@@ -435,18 +435,48 @@ individual columns:
435435 df = pd.read_csv(StringIO(data), dtype = {' b' : object , ' c' : np.float64})
436436 df.dtypes
437437
438+ Fortunately, ``pandas `` offers more than one way to ensure that your column(s)
439+ contain only one ``dtype ``. For instance, you can use the ``converters `` argument
440+ of :func: `~pandas.read_csv `:
441+
442+ .. ipython :: python
443+
444+ data = " col_1\n 1\n 2\n 'A'\n 4.22"
445+ df = pd.read_csv(StringIO(data), converters = {' col_1' :str })
446+ df
447+ df[' col_1' ].apply(type ).value_counts()
448+
449+ Or you can use the :func: `~pandas.to_numeric ` function to coerce the
450+ dtypes after reading in the data,
451+
452+ .. ipython :: python
453+
454+ df2 = pd.read_csv(StringIO(data))
455+ df2[' col_1' ] = pd.to_numeric(df2[' col_1' ], errors = ' coerce' )
456+ df2
457+ df2[' col_1' ].apply(type ).value_counts()
458+
459+ which would convert all valid parsing to floats, leaving the invalid parsing
460+ as ``NaN ``.
461+
462+ Ultimately, how you deal with reading in columns containing mixed dtypes
463+ depends on your specific needs. In the case above, if you wanted to ``NaN `` out
464+ the data anomalies, then :func: `~pandas.to_numeric ` is probably your best option.
465+ However, if you wanted for all the data to be coerced, no matter the type, then
466+ using the ``converters `` argument of :func: `~pandas.read_csv ` would certainly be
467+ worth trying.
468+
438469.. note ::
439470 The ``dtype `` option is currently only supported by the C engine.
440471 Specifying ``dtype `` with ``engine `` other than 'c' raises a
441472 ``ValueError ``.
442473
443474.. note ::
444-
445- Reading in data with columns containing mixed dtypes and relying
446- on ``pandas `` to infer them is not recommended. In doing so, the
447- parsing engine will infer the dtypes for different chunks of the data,
448- rather than the whole dataset at once. Consequently, you can end up with
449- column(s) with mixed dtypes. For example,
475+ In some cases, reading in abnormal data with columns containing mixed dtypes
476+ will result in an inconsistent dataset. If you rely on pandas to infer the
477+ dtypes of your columns, the parsing engine will go and infer the dtypes for
478+ different chunks of the data, rather than the whole dataset at once. Consequently,
479+ you can end up with column(s) with mixed dtypes. For example,
450480
451481 .. ipython :: python
452482 :okwarning:
@@ -458,45 +488,11 @@ individual columns:
458488 mixed_df[' col_1' ].dtype
459489
460490 will result with `mixed_df ` containing an ``int `` dtype for certain chunks
461- of the column, and ``str `` for others due to a problem during parsing.
462- It is important to note that the overall column will be marked with a
463- ``dtype `` of ``object ``, which is used for columns with mixed dtypes.
464-
465- Fortunately, ``pandas `` offers a few ways to ensure that the column(s)
466- contain only one ``dtype ``. For instance, you could use the ``converters ``
467- argument of :func: `~pandas.read_csv `
468-
469- .. ipython :: python
470-
471- fixed_df1 = pd.read_csv(' foo' , converters = {' col_1' :str })
472- fixed_df1[' col_1' ].apply(type ).value_counts()
473-
474- Or you could use the :func: `~pandas.to_numeric ` function to coerce the
475- dtypes after reading in the data,
476-
477- .. ipython :: python
478- :okwarning:
479-
480- fixed_df2 = pd.read_csv(' foo' )
481- fixed_df2[' col_1' ] = pd.to_numeric(fixed_df2[' col_1' ], errors = ' coerce' )
482- fixed_df2[' col_1' ].apply(type ).value_counts()
483-
484- which would convert all valid parsing to floats, leaving the invalid parsing
485- as ``NaN ``.
486-
487- Alternatively, you could set the ``low_memory `` argument of :func: `~pandas.read_csv `
488- to ``False ``. Such as,
489-
490- .. ipython :: python
491+ of the column, and ``str `` for others due to the mixed dtypes from the
492+ data that was read in. It is important to note that the overall column will be
493+ marked with a ``dtype `` of ``object ``, which is used for columns with mixed dtypes.
491494
492- fixed_df3 = pd.read_csv(' foo' , low_memory = False )
493- fixed_df3[' col_1' ].apply(type ).value_counts()
494495
495- Ultimately, how you deal with reading in columns containing mixed dtypes
496- depends on your specific needs. In the case above, if you wanted to ``NaN `` out
497- the data anomalies, then :func: `~pandas.to_numeric ` is probably your best option.
498- However, if you wanted for all the data to be coerced, no matter the type, then
499- using the ``converters `` argument of :func: `~pandas.read_csv ` would certainly work.
500496
501497Naming and Using Columns
502498''''''''''''''''''''''''
0 commit comments