@@ -440,6 +440,45 @@ individual columns:
440440 Specifying ``dtype `` with ``engine `` other than 'c' raises a
441441 ``ValueError ``.
442442
443+ .. note ::
444+
445+ Reading in data with mixed dtypes and relying on ``pandas ``
446+ to infer them is not recommended. In doing so, the parsing engine will
447+ loop over all the dtypes, trying to convert them to an actual
448+ type; if something breaks during that process, the engine will go to the
449+ next ``dtype `` and the data is left modified in place. For example,
450+
451+ .. ipython :: python
452+
453+ from collections import Counter
454+ df = pd.DataFrame({' col_1' :range (500000 ) + [' a' , ' b' ] + range (500000 )})
455+ df.to_csv(' foo' )
456+ mixed_df = pd.read_csv(' foo' )
457+ Counter(mixed_df[' col_1' ].apply(lambda x : type (x)))
458+
459+ will result with `mixed_df ` containing an ``int `` dtype for the first
460+ 262,143 values, and ``str `` for others due to a problem during
461+ parsing. Fortunately, ``pandas `` offers a few ways to ensure that the column(s)
462+ contain only one ``dtype ``. For instance, you could use the ``converters ``
463+ argument of :func: `~pandas.read_csv `
464+
465+ .. ipython :: python
466+
467+ fixed_df1 = pd.read_csv(' foo' , converters = {' col_1' :str })
468+ Counter(fixed_df1[' col_1' ].apply(lambda x : type (x)))
469+
470+ Or you could use the :func: `~pandas.to_numeric ` function to coerce the
471+ dtypes after reading in the data,
472+
473+ .. ipython :: python
474+
475+ fixed_df2 = pd.read_csv(' foo' )
476+ fixed_df2[' col_1' ] = pd.to_numeric(fixed_df2[' col_1' ], errors = ' coerce' )
477+ Counter(fixed_df2[' col_1' ].apply(lambda x : type (x)))
478+
479+ which would convert all valid parsing to ints, leaving the invalid parsing
480+ as ``NaN ``.
481+
443482Naming and Using Columns
444483''''''''''''''''''''''''
445484
0 commit comments