diff --git a/doc/source/user_guide/io.rst b/doc/source/user_guide/io.rst index c33d4ab92d4c6..1002eb9ee8568 100644 --- a/doc/source/user_guide/io.rst +++ b/doc/source/user_guide/io.rst @@ -302,7 +302,7 @@ date_format : str or dict of column -> format, default ``None`` format. For anything more complex, please read in as ``object`` and then apply :func:`to_datetime` as-needed. - .. versionadded:: 2.0.0 + .. versionadded:: 2.0.0 dayfirst : boolean, default ``False`` DD/MM format dates, international and European format. cache_dates : boolean, default True @@ -385,9 +385,9 @@ on_bad_lines : {{'error', 'warn', 'skip'}}, default 'error' Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are : - - 'error', raise an ParserError when a bad line is encountered. - - 'warn', print a warning when a bad line is encountered and skip that line. - - 'skip', skip bad lines without raising or warning when they are encountered. + - 'error', raise an ParserError when a bad line is encountered. + - 'warn', print a warning when a bad line is encountered and skip that line. + - 'skip', skip bad lines without raising or warning when they are encountered. .. versionadded:: 1.3.0 @@ -1998,12 +1998,12 @@ fall back in the following manner: * if an object is unsupported it will attempt the following: - * check if the object has defined a ``toDict`` method and call it. + - check if the object has defined a ``toDict`` method and call it. A ``toDict`` method should return a ``dict`` which will then be JSON serialized. - * invoke the ``default_handler`` if one was provided. + - invoke the ``default_handler`` if one was provided. - * convert the object to a ``dict`` by traversing its contents. However this will often fail + - convert the object to a ``dict`` by traversing its contents. However this will often fail with an ``OverflowError`` or give unexpected results. In general the best approach for unsupported objects or dtypes is to provide a ``default_handler``. @@ -2092,19 +2092,19 @@ preserve string-like numbers (e.g. '1', '2') in an axes. Large integer values may be converted to dates if ``convert_dates=True`` and the data and / or column labels appear 'date-like'. The exact threshold depends on the ``date_unit`` specified. 'date-like' means that the column label meets one of the following criteria: - * it ends with ``'_at'`` - * it ends with ``'_time'`` - * it begins with ``'timestamp'`` - * it is ``'modified'`` - * it is ``'date'`` + * it ends with ``'_at'`` + * it ends with ``'_time'`` + * it begins with ``'timestamp'`` + * it is ``'modified'`` + * it is ``'date'`` .. warning:: When reading JSON data, automatic coercing into dtypes has some quirks: - * an index can be reconstructed in a different order from serialization, that is, the returned order is not guaranteed to be the same as before serialization - * a column that was ``float`` data will be converted to ``integer`` if it can be done safely, e.g. a column of ``1.`` - * bool columns will be converted to ``integer`` on reconstruction + * an index can be reconstructed in a different order from serialization, that is, the returned order is not guaranteed to be the same as before serialization + * a column that was ``float`` data will be converted to ``integer`` if it can be done safely, e.g. a column of ``1.`` + * bool columns will be converted to ``integer`` on reconstruction Thus there are times where you may want to specify specific dtypes via the ``dtype`` keyword argument. @@ -2370,19 +2370,19 @@ A few notes on the generated table schema: * The default naming roughly follows these rules: - * For series, the ``object.name`` is used. If that's none, then the + - For series, the ``object.name`` is used. If that's none, then the name is ``values`` - * For ``DataFrames``, the stringified version of the column name is used - * For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a + - For ``DataFrames``, the stringified version of the column name is used + - For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a fallback to ``index`` if that is None. - * For ``MultiIndex``, ``mi.names`` is used. If any level has no name, + - For ``MultiIndex``, ``mi.names`` is used. If any level has no name, then ``level_`` is used. ``read_json`` also accepts ``orient='table'`` as an argument. This allows for the preservation of metadata such as dtypes and index names in a round-trippable manner. - .. ipython:: python +.. ipython:: python df = pd.DataFrame( { @@ -2780,20 +2780,20 @@ parse HTML tables in the top-level pandas io function ``read_html``. * Benefits - * |lxml|_ is very fast. + - |lxml|_ is very fast. - * |lxml|_ requires Cython to install correctly. + - |lxml|_ requires Cython to install correctly. * Drawbacks - * |lxml|_ does *not* make any guarantees about the results of its parse + - |lxml|_ does *not* make any guarantees about the results of its parse *unless* it is given |svm|_. - * In light of the above, we have chosen to allow you, the user, to use the + - In light of the above, we have chosen to allow you, the user, to use the |lxml|_ backend, but **this backend will use** |html5lib|_ if |lxml|_ fails to parse - * It is therefore *highly recommended* that you install both + - It is therefore *highly recommended* that you install both |BeautifulSoup4|_ and |html5lib|_, so that you will still get a valid result (provided everything else is valid) even if |lxml|_ fails. @@ -2806,22 +2806,22 @@ parse HTML tables in the top-level pandas io function ``read_html``. * Benefits - * |html5lib|_ is far more lenient than |lxml|_ and consequently deals + - |html5lib|_ is far more lenient than |lxml|_ and consequently deals with *real-life markup* in a much saner way rather than just, e.g., dropping an element without notifying you. - * |html5lib|_ *generates valid HTML5 markup from invalid markup + - |html5lib|_ *generates valid HTML5 markup from invalid markup automatically*. This is extremely important for parsing HTML tables, since it guarantees a valid document. However, that does NOT mean that it is "correct", since the process of fixing markup does not have a single definition. - * |html5lib|_ is pure Python and requires no additional build steps beyond + - |html5lib|_ is pure Python and requires no additional build steps beyond its own installation. * Drawbacks - * The biggest drawback to using |html5lib|_ is that it is slow as + - The biggest drawback to using |html5lib|_ is that it is slow as molasses. However consider the fact that many tables on the web are not big enough for the parsing algorithm runtime to matter. It is more likely that the bottleneck will be in the process of reading the raw @@ -3211,7 +3211,7 @@ supports parsing such sizeable files using `lxml's iterparse`_ and `etree's iter which are memory-efficient methods to iterate through an XML tree and extract specific elements and attributes. without holding entire tree in memory. - .. versionadded:: 1.5.0 +.. versionadded:: 1.5.0 .. _`lxml's iterparse`: https://lxml.de/3.2/parsing.html#iterparse-and-iterwalk .. _`etree's iterparse`: https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse diff --git a/doc/source/user_guide/pyarrow.rst b/doc/source/user_guide/pyarrow.rst index 63937ed27b8b2..61b383afb7c43 100644 --- a/doc/source/user_guide/pyarrow.rst +++ b/doc/source/user_guide/pyarrow.rst @@ -37,10 +37,10 @@ which is similar to a NumPy array. To construct these from the main pandas data .. note:: - The string alias ``"string[pyarrow]"`` maps to ``pd.StringDtype("pyarrow")`` which is not equivalent to - specifying ``dtype=pd.ArrowDtype(pa.string())``. Generally, operations on the data will behave similarly - except ``pd.StringDtype("pyarrow")`` can return NumPy-backed nullable types while ``pd.ArrowDtype(pa.string())`` - will return :class:`ArrowDtype`. + The string alias ``"string[pyarrow]"`` maps to ``pd.StringDtype("pyarrow")`` which is not equivalent to + specifying ``dtype=pd.ArrowDtype(pa.string())``. Generally, operations on the data will behave similarly + except ``pd.StringDtype("pyarrow")`` can return NumPy-backed nullable types while ``pd.ArrowDtype(pa.string())`` + will return :class:`ArrowDtype`. .. ipython:: python @@ -62,10 +62,14 @@ into :class:`ArrowDtype` to use in the ``dtype`` parameter. ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type)) ser +.. ipython:: python + from datetime import time idx = pd.Index([time(12, 30), None], dtype=pd.ArrowDtype(pa.time64("us"))) idx +.. ipython:: python + from decimal import Decimal decimal_type = pd.ArrowDtype(pa.decimal128(3, scale=2)) data = [[Decimal("3.19"), None], [None, Decimal("-1.23")]] @@ -78,7 +82,10 @@ or :class:`DataFrame` object. .. ipython:: python - pa_array = pa.array([{"1": "2"}, {"10": "20"}, None]) + pa_array = pa.array( + [{"1": "2"}, {"10": "20"}, None], + type=pa.map_(pa.string(), pa.string()), + ) ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array)) ser @@ -133,9 +140,13 @@ The following are just some examples of operations that are accelerated by nativ ser.isna() ser.fillna(0) +.. ipython:: python + ser_str = pd.Series(["a", "b", None], dtype=pd.ArrowDtype(pa.string())) ser_str.str.startswith("a") +.. ipython:: python + from datetime import datetime pa_type = pd.ArrowDtype(pa.timestamp("ns")) ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type)