@@ -444,41 +444,55 @@ individual columns:
444444
445445 Reading in data with mixed dtypes and relying on ``pandas ``
446446 to infer them is not recommended. In doing so, the parsing engine will
447- loop over all the dtypes, trying to convert them to an actual
448- type; if something breaks during that process, the engine will go to the
449- next `` dtype `` and the data is left modified in place . For example,
447+ infer the dtypes for different chunks of the data, rather than the whole
448+ dataset at once. Consequently, you can end up with column(s) with mixed
449+ dtypes . For example,
450450
451451 .. ipython :: python
452+ :okwarning:
452453
453- from collections import Counter
454454 df = pd.DataFrame({' col_1' :range (500000 ) + [' a' , ' b' ] + range (500000 )})
455455 df.to_csv(' foo' )
456456 mixed_df = pd.read_csv(' foo' )
457- Counter(mixed_df[' col_1' ].apply(lambda x : type (x)))
457+ mixed_df[' col_1' ].apply(lambda x : type (x)).value_counts()
458+ mixed_df[' col_1' ].dtype
458459
459460 will result with `mixed_df ` containing an ``int `` dtype for the first
460461 262,143 values, and ``str `` for others due to a problem during
461- parsing. Fortunately, ``pandas `` offers a few ways to ensure that the column(s)
462+ parsing. It is important to note that the overall column will be marked with a
463+ ``dtype `` of ``object ``, which is used for columns with mixed dtypes.
464+
465+ Fortunately, ``pandas `` offers a few ways to ensure that the column(s)
462466 contain only one ``dtype ``. For instance, you could use the ``converters ``
463467 argument of :func: `~pandas.read_csv `
464468
465469 .. ipython :: python
466470
467471 fixed_df1 = pd.read_csv(' foo' , converters = {' col_1' :str })
468- Counter( fixed_df1[' col_1' ].apply(lambda x : type (x)))
472+ fixed_df1[' col_1' ].apply(lambda x : type (x)).value_counts( )
469473
470474 Or you could use the :func: `~pandas.to_numeric ` function to coerce the
471475 dtypes after reading in the data,
472476
473477 .. ipython :: python
478+ :okwarning:
474479
475480 fixed_df2 = pd.read_csv(' foo' )
476481 fixed_df2[' col_1' ] = pd.to_numeric(fixed_df2[' col_1' ], errors = ' coerce' )
477- Counter( fixed_df2[' col_1' ].apply(lambda x : type (x)))
482+ fixed_df2[' col_1' ].apply(lambda x : type (x)).value_counts( )
478483
479484 which would convert all valid parsing to ints, leaving the invalid parsing
480485 as ``NaN ``.
481486
487+ Alternatively, you could set the ``low_memory `` argument of :func: `~pandas.read_csv `
488+ to ``False ``. Such as,
489+
490+ .. ipython :: python
491+
492+ fixed_df3 = pd.read_csv(' foo' , low_memory = False )
493+ fixed_df2[' col_1' ].apply(lambda x : type (x)).value_counts()
494+
495+ which achieves a similar result.
482496Naming and Using Columns
483497''''''''''''''''''''''''
484498
0 commit comments