@@ -442,24 +442,24 @@ individual columns:
442442
443443.. note ::
444444
445- Reading in data with mixed dtypes and relying on `` pandas ``
446- to infer them is not recommended. In doing so, the parsing engine will
447- infer the dtypes for different chunks of the data, rather than the whole
448- dataset at once. Consequently, you can end up with column(s) with mixed
449- dtypes. For example,
445+ Reading in data with columns containing mixed dtypes and relying
446+ on `` pandas `` to infer them is not recommended. In doing so, the
447+ parsing engine will infer the dtypes for different chunks of the data,
448+ rather than the whole dataset at once. Consequently, you can end up with
449+ column(s) with mixed dtypes. For example,
450450
451451 .. ipython :: python
452452 :okwarning:
453453
454454 df = pd.DataFrame({' col_1' :range (500000 ) + [' a' , ' b' ] + range (500000 )})
455455 df.to_csv(' foo' )
456456 mixed_df = pd.read_csv(' foo' )
457- mixed_df[' col_1' ].apply(lambda x : type (x) ).value_counts()
457+ mixed_df[' col_1' ].apply(type ).value_counts()
458458 mixed_df[' col_1' ].dtype
459459
460- will result with `mixed_df ` containing an ``int `` dtype for the first
461- 262,143 values , and ``str `` for others due to a problem during
462- parsing. It is important to note that the overall column will be marked with a
460+ will result with `mixed_df ` containing an ``int `` dtype for certain chunks
461+ of the column , and ``str `` for others due to a problem during parsing.
462+ It is important to note that the overall column will be marked with a
463463 ``dtype `` of ``object ``, which is used for columns with mixed dtypes.
464464
465465 Fortunately, ``pandas `` offers a few ways to ensure that the column(s)
@@ -469,7 +469,7 @@ individual columns:
469469 .. ipython :: python
470470
471471 fixed_df1 = pd.read_csv(' foo' , converters = {' col_1' :str })
472- fixed_df1[' col_1' ].apply(lambda x : type (x) ).value_counts()
472+ fixed_df1[' col_1' ].apply(type ).value_counts()
473473
474474 Or you could use the :func: `~pandas.to_numeric ` function to coerce the
475475 dtypes after reading in the data,
@@ -479,9 +479,9 @@ individual columns:
479479
480480 fixed_df2 = pd.read_csv(' foo' )
481481 fixed_df2[' col_1' ] = pd.to_numeric(fixed_df2[' col_1' ], errors = ' coerce' )
482- fixed_df2[' col_1' ].apply(lambda x : type (x) ).value_counts()
482+ fixed_df2[' col_1' ].apply(type ).value_counts()
483483
484- which would convert all valid parsing to ints , leaving the invalid parsing
484+ which would convert all valid parsing to floats , leaving the invalid parsing
485485 as ``NaN ``.
486486
487487 Alternatively, you could set the ``low_memory `` argument of :func: `~pandas.read_csv `
@@ -490,9 +490,14 @@ individual columns:
490490 .. ipython :: python
491491
492492 fixed_df3 = pd.read_csv(' foo' , low_memory = False )
493- fixed_df2[' col_1' ].apply(lambda x : type (x)).value_counts()
493+ fixed_df3[' col_1' ].apply(type ).value_counts()
494+
495+ Ultimately, how you deal with reading in columns containing mixed dtypes
496+ depends on your specific needs. In the case above, if you wanted to ``NaN `` out
497+ the data anomalies, then :func: `~pandas.to_numeric ` is probably your best option.
498+ However, if you wanted for all the data to be coerced, no matter the type, then
499+ using the ``converters `` argument of :func: `~pandas.read_csv ` would certainly work.
494500
495- which achieves a similar result.
496501Naming and Using Columns
497502''''''''''''''''''''''''
498503
0 commit comments