@@ -72,123 +72,201 @@ CSV & Text files
7272----------------
7373
7474The two workhorse functions for reading text files (a.k.a. flat files) are
75- :func: `~pandas.io.parsers.read_csv ` and :func: `~pandas.io.parsers.read_table `.
76- They both use the same parsing code to intelligently convert tabular
77- data into a DataFrame object. See the :ref: `cookbook<cookbook.csv> `
78- for some advanced strategies
75+ :func: `read_csv ` and :func: `read_table `. They both use the same parsing code to
76+ intelligently convert tabular data into a DataFrame object. See the
77+ :ref: `cookbook<cookbook.csv> ` for some advanced strategies.
78+
79+ Parsing options
80+ '''''''''''''''
81+
82+ :func: `read_csv ` and :func: `read_table ` accept the following arguments:
83+
84+ Basic
85+ +++++
86+
87+ filepath_or_buffer : various
88+ Either a path to a file (a :class: `python:str `, :class: `python:pathlib.Path `,
89+ or :class: `py:py._path.local.LocalPath `), URL (including http, ftp, and S3
90+ locations), or any object with a ``read() `` method (such as an open file or
91+ :class: `~python:io.StringIO `).
92+ sep : str, defaults to ``',' `` for :func: `read_csv `, ``\t `` for :func: `read_table `
93+ Delimiter to use. If sep is ``None ``,
94+ will try to automatically determine this. Regular expressions are accepted,
95+ use of a regular expression will force use of the python parsing engine and
96+ will ignore quotes in the data.
97+ delimiter : str, default ``None ``
98+ Alternative argument name for sep.
99+
100+ Column and Index Locations and Names
101+ ++++++++++++++++++++++++++++++++++++
102+
103+ header : int or list of ints, default ``'infer' ``
104+ Row number(s) to use as the column names, and the start of the data. Default
105+ behavior is as if ``header=0 `` if no ``names `` passed, otherwise as if
106+ ``header=None ``. Explicitly pass ``header=0 `` to be able to replace existing
107+ names. The header can be a list of ints that specify row locations for a
108+ multi-index on the columns e.g. ``[0,1,3] ``. Intervening rows that are not
109+ specified will be skipped (e.g. 2 in this example is skipped). Note that
110+ this parameter ignores commented lines and empty lines if
111+ ``skip_blank_lines=True ``, so header=0 denotes the first line of data
112+ rather than the first line of the file.
113+ names : array-like, default ``None ``
114+ List of column names to use. If file contains no header row, then you should
115+ explicitly pass ``header=None ``.
116+ index_col : int or sequence or ``False ``, default ``None ``
117+ Column to use as the row labels of the DataFrame. If a sequence is given, a
118+ MultiIndex is used. If you have a malformed file with delimiters at the end of
119+ each line, you might consider ``index_col=False `` to force pandas to *not * use
120+ the first column as the index (row names).
121+ usecols : array-like, default ``None ``
122+ Return a subset of the columns. Results in much faster parsing time and lower
123+ memory usage
124+ squeeze : boolean, default ``False ``
125+ If the parsed data only contains one column then return a Series.
126+ prefix : str, default ``None ``
127+ Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
128+ mangle_dupe_cols : boolean, default ``True ``
129+ Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'.
130+
131+ General Parsing Configuration
132+ +++++++++++++++++++++++++++++
133+
134+ dtype : Type name or dict of column -> type, default ``None ``
135+ Data type for data or columns. E.g. ``{'a': np.float64, 'b': np.int32} ``
136+ (unsupported with ``engine='python' ``). Use `str ` or `object ` to preserve and
137+ not interpret dtype.
138+ engine : {``'c' ``, ``'python' ``}
139+ Parser engine to use. The C engine is faster while the python engine is
140+ currently more feature-complete.
141+ converters : dict, default ``None ``
142+ Dict of functions for converting values in certain columns. Keys can either be
143+ integers or column labels.
144+ true_values : list, default ``None ``
145+ Values to consider as ``True ``.
146+ false_values : list, default ``None ``
147+ Values to consider as ``False ``.
148+ skipinitialspace : boolean, default ``False ``
149+ Skip spaces after delimiter.
150+ skiprows : list-like or integer, default ``None ``
151+ Line numbers to skip (0-indexed) or number of lines to skip (int) at the start
152+ of the file.
153+ skipfooter : int, default ``0 ``
154+ Number of lines at bottom of file to skip (unsupported with engine='c').
155+ nrows : int, default ``None ``
156+ Number of rows of file to read. Useful for reading pieces of large files.
157+
158+ NA and Missing Data Handling
159+ ++++++++++++++++++++++++++++
160+
161+ na_values : str, list-like or dict, default ``None ``
162+ Additional strings to recognize as NA/NaN. If dict passed, specific per-column
163+ NA values. By default the following values are interpreted as NaN:
164+ ``'-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'NA',
165+ '#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan', '' ``.
166+ keep_default_na : boolean, default ``True ``
167+ If na_values are specified and keep_default_na is ``False `` the default NaN
168+ values are overridden, otherwise they're appended to.
169+ na_filter : boolean, default ``True ``
170+ Detect missing value markers (empty strings and the value of na_values). In
171+ data without any NAs, passing ``na_filter=False `` can improve the performance
172+ of reading a large file.
173+ verbose : boolean, default ``False ``
174+ Indicate number of NA values placed in non-numeric columns.
175+ skip_blank_lines : boolean, default ``True ``
176+ If ``True ``, skip over blank lines rather than interpreting as NaN values.
177+
178+ Datetime Handling
179+ +++++++++++++++++
180+
181+ parse_dates : boolean or list of ints or names or list of lists or dict, default ``False ``.
182+ - If ``True `` -> try parsing the index.
183+ - If ``[1, 2, 3] `` -> try parsing columns 1, 2, 3 each as a separate date
184+ column.
185+ - If ``[[1, 3]] `` -> combine columns 1 and 3 and parse as a single date
186+ column.
187+ - If ``{'foo' : [1, 3]} `` -> parse columns 1, 3 as date and call result 'foo'.
188+ A fast-path exists for iso8601-formatted dates.
189+ infer_datetime_format : boolean, default ``False ``
190+ If ``True `` and parse_dates is enabled for a column, attempt to infer the
191+ datetime format to speed up the processing.
192+ keep_date_col : boolean, default ``False ``
193+ If ``True `` and parse_dates specifies combining multiple columns then keep the
194+ original columns.
195+ date_parser : function, default ``None ``
196+ Function to use for converting a sequence of string columns to an array of
197+ datetime instances. The default uses ``dateutil.parser.parser `` to do the
198+ conversion. Pandas will try to call date_parser in three different ways,
199+ advancing to the next if an exception occurs: 1) Pass one or more arrays (as
200+ defined by parse_dates) as arguments; 2) concatenate (row-wise) the string
201+ values from the columns defined by parse_dates into a single array and pass
202+ that; and 3) call date_parser once for each row using one or more strings
203+ (corresponding to the columns defined by parse_dates) as arguments.
204+ dayfirst : boolean, default ``False ``
205+ DD/MM format dates, international and European format.
206+
207+ Iteration
208+ +++++++++
209+
210+ iterator : boolean, default ``False ``
211+ Return `TextFileReader ` object for iteration or getting chunks with
212+ ``get_chunk() ``.
213+ chunksize : int, default ``None ``
214+ Return `TextFileReader ` object for iteration. See :ref: `iterating and chunking
215+ <io.chunking>` below.
216+
217+ Quoting, Compression, and File Format
218+ +++++++++++++++++++++++++++++++++++++
219+
220+ compression : {``'infer' ``, ``'gzip' ``, ``'bz2' ``, ``None ``}, default ``'infer' ``
221+ For on-the-fly decompression of on-disk data. If 'infer', then use gzip or bz2
222+ if filepath_or_buffer is a string ending in '.gz' or '.bz2', respectively, and
223+ no decompression otherwise. Set to ``None `` for no decompression.
224+ thousands : str, default ``None ``
225+ Thousands separator.
226+ decimal : str, default ``'.' ``
227+ Character to recognize as decimal point. E.g. use ``',' `` for European data.
228+ lineterminator : str (length 1), default ``None ``
229+ Character to break file into lines. Only valid with C parser.
230+ quotechar : str (length 1)
231+ The character used to denote the start and end of a quoted item. Quoted items
232+ can include the delimiter and it will be ignored.
233+ quoting : int or ``csv.QUOTE_* `` instance, default ``None ``
234+ Control field quoting behavior per ``csv.QUOTE_* `` constants. Use one of
235+ ``QUOTE_MINIMAL `` (0), ``QUOTE_ALL `` (1), ``QUOTE_NONNUMERIC `` (2) or
236+ ``QUOTE_NONE `` (3). Default (``None ``) results in ``QUOTE_MINIMAL ``
237+ behavior.
238+ escapechar : str (length 1), default ``None ``
239+ One-character string used to escape delimiter when quoting is ``QUOTE_NONE ``.
240+ comment : str, default ``None ``
241+ Indicates remainder of line should not be parsed. If found at the beginning of
242+ a line, the line will be ignored altogether. This parameter must be a single
243+ character. Like empty lines (as long as ``skip_blank_lines=True ``), fully
244+ commented lines are ignored by the parameter `header ` but not by `skiprows `.
245+ For example, if ``comment='#' ``, parsing '#empty\\ na,b,c\\ n1,2,3' with
246+ `header=0 ` will result in 'a,b,c' being treated as the header.
247+ encoding : str, default ``None ``
248+ Encoding to use for UTF when reading/writing (e.g. ``'utf-8' ``). `List of
249+ Python standard encodings
250+ <https://docs.python.org/3/library/codecs.html#standard-encodings> `_.
251+ dialect : str or :class: `python:csv.Dialect ` instance, default ``None ``
252+ If ``None `` defaults to Excel dialect. Ignored if sep longer than 1 char. See
253+ :class: `python:csv.Dialect ` documentation for more details.
254+ tupleize_cols : boolean, default ``False ``
255+ Leave a list of tuples on columns as is (default is to convert to a MultiIndex
256+ on the columns).
257+
258+ Error Handling
259+ ++++++++++++++
79260
80- They can take a number of arguments:
81-
82- - ``filepath_or_buffer ``: Either a path to a file (a :class: `python:str `,
83- :class: `python:pathlib.Path `, or :class: `py:py._path.local.LocalPath `), URL
84- (including http, ftp, and S3 locations), or any object with a ``read ``
85- method (such as an open file or :class: `~python:io.StringIO `).
86- - ``sep `` or ``delimiter ``: A delimiter / separator to split fields
87- on. With ``sep=None ``, ``read_csv `` will try to infer the delimiter
88- automatically in some cases by "sniffing".
89- The separator may be specified as a regular expression; for instance
90- you may use '\|\\ s*' to indicate a pipe plus arbitrary whitespace, but ignores quotes in the data when a regex is used in separator.
91- - ``delim_whitespace ``: Parse whitespace-delimited (spaces or tabs) file
92- (much faster than using a regular expression)
93- - ``compression ``: decompress ``'gzip' `` and ``'bz2' `` formats on the fly.
94- Set to ``'infer' `` (the default) to guess a format based on the file
95- extension.
96- - ``dialect ``: string or :class: `python:csv.Dialect ` instance to expose more
97- ways to specify the file format
98- - ``dtype ``: A data type name or a dict of column name to data type. If not
99- specified, data types will be inferred. (Unsupported with
100- ``engine='python' ``)
101- - ``header ``: row number(s) to use as the column names, and the start of the
102- data. Defaults to 0 if no ``names `` passed, otherwise ``None ``. Explicitly
103- pass ``header=0 `` to be able to replace existing names. The header can be
104- a list of integers that specify row locations for a multi-index on the columns
105- E.g. [0,1,3]. Intervening rows that are not specified will be
106- skipped (e.g. 2 in this example are skipped). Note that this parameter
107- ignores commented lines and empty lines if ``skip_blank_lines=True `` (the default),
108- so header=0 denotes the first line of data rather than the first line of the file.
109- - ``skip_blank_lines ``: whether to skip over blank lines rather than interpreting
110- them as NaN values
111- - ``skiprows ``: A collection of numbers for rows in the file to skip. Can
112- also be an integer to skip the first ``n `` rows
113- - ``index_col ``: column number, column name, or list of column numbers/names,
114- to use as the ``index `` (row labels) of the resulting DataFrame. By default,
115- it will number the rows without using any column, unless there is one more
116- data column than there are headers, in which case the first column is taken
117- as the index.
118- - ``names ``: List of column names to use as column names. To replace header
119- existing in file, explicitly pass ``header=0 ``.
120- - ``na_values ``: optional string or list of strings to recognize as NaN (missing
121- values), either in addition to or in lieu of the default set.
122- - ``true_values ``: list of strings to recognize as ``True ``
123- - ``false_values ``: list of strings to recognize as ``False ``
124- - ``keep_default_na ``: whether to include the default set of missing values
125- in addition to the ones specified in ``na_values ``
126- - ``parse_dates ``: if True then index will be parsed as dates
127- (False by default). You can specify more complicated options to parse
128- a subset of columns or a combination of columns into a single date column
129- (list of ints or names, list of lists, or dict)
130- [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column
131- [[1, 3]] -> combine columns 1 and 3 and parse as a single date column
132- {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
133- - ``keep_date_col ``: if True, then date component columns passed into
134- ``parse_dates `` will be retained in the output (False by default).
135- - ``date_parser ``: function to use to parse strings into datetime
136- objects. If ``parse_dates `` is True, it defaults to the very robust
137- ``dateutil.parser ``. Specifying this implicitly sets ``parse_dates `` as True.
138- You can also use functions from community supported date converters from
139- date_converters.py
140- - ``dayfirst ``: if True then uses the DD/MM international/European date format
141- (This is False by default)
142- - ``thousands ``: specifies the thousands separator. If not None, this character will
143- be stripped from numeric dtypes. However, if it is the first character in a field,
144- that column will be imported as a string. In the PythonParser, if not None,
145- then parser will try to look for it in the output and parse relevant data to numeric
146- dtypes. Because it has to essentially scan through the data again, this causes a
147- significant performance hit so only use if necessary.
148- - ``lineterminator `` : string (length 1), default ``None ``, Character to break file into lines. Only valid with C parser
149- - ``quotechar `` : string, The character to used to denote the start and end of a quoted item.
150- Quoted items can include the delimiter and it will be ignored.
151- - ``quoting `` : int,
152- Controls whether quotes should be recognized. Values are taken from `csv.QUOTE_* ` values.
153- Acceptable values are 0, 1, 2, and 3 for QUOTE_MINIMAL, QUOTE_ALL,
154- QUOTE_NONNUMERIC and QUOTE_NONE, respectively.
155- - ``skipinitialspace `` : boolean, default ``False ``, Skip spaces after delimiter
156- - ``escapechar `` : string, to specify how to escape quoted data
157- - ``comment ``: Indicates remainder of line should not be parsed. If found at the
158- beginning of a line, the line will be ignored altogether. This parameter
159- must be a single character. Like empty lines, fully commented lines
160- are ignored by the parameter `header ` but not by `skiprows `. For example,
161- if comment='#', parsing '#empty\n 1,2,3\n a,b,c' with `header=0 ` will
162- result in '1,2,3' being treated as the header.
163- - ``nrows ``: Number of rows to read out of the file. Useful to only read a
164- small portion of a large file
165- - ``iterator ``: If True, return a ``TextFileReader `` to enable reading a file
166- into memory piece by piece
167- - ``chunksize ``: An number of rows to be used to "chunk" a file into
168- pieces. Will cause an ``TextFileReader `` object to be returned. More on this
169- below in the section on :ref: `iterating and chunking <io.chunking >`
170- - ``skip_footer ``: number of lines to skip at bottom of file (default 0)
171- (Unsupported with ``engine='c' ``)
172- - ``converters ``: a dictionary of functions for converting values in certain
173- columns, where keys are either integers or column labels
174- - ``encoding ``: a string representing the encoding to use for decoding
175- unicode data, e.g. ``'utf-8` `` or ``'latin-1' ``. `Full list of Python
176- standard encodings
177- <https://docs.python.org/3/library/codecs.html#standard-encodings> `_
178- - ``verbose ``: show number of NA values inserted in non-numeric columns
179- - ``squeeze ``: if True then output with only one column is turned into Series
180- - ``error_bad_lines ``: if False then any lines causing an error will be skipped :ref: `bad lines <io.bad_lines >`
181- - ``usecols ``: a subset of columns to return, results in much faster parsing
182- time and lower memory usage.
183- - ``mangle_dupe_cols ``: boolean, default True, then duplicate columns will be specified
184- as 'X.0'...'X.N', rather than 'X'...'X'
185- - ``tupleize_cols ``: boolean, default False, if False, convert a list of tuples
186- to a multi-index of columns, otherwise, leave the column index as a list of
187- tuples
188- - ``float_precision `` : string, default None. Specifies which converter the C
189- engine should use for floating-point values. The options are None for the
190- ordinary converter, 'high' for the high-precision converter, and
191- 'round_trip' for the round-trip converter.
261+ error_bad_lines : boolean, default ``True ``
262+ Lines with too many fields (e.g. a csv line with too many commas) will by
263+ default cause an exception to be raised, and no DataFrame will be returned. If
264+ ``False ``, then these "bad lines" will dropped from the DataFrame that is
265+ returned (only valid with C parser). See :ref: `bad lines <io.bad_lines >`
266+ below.
267+ warn_bad_lines : boolean, default ``True ``
268+ If error_bad_lines is ``False ``, and warn_bad_lines is ``True ``, a warning for
269+ each "bad line" will be output (only valid with C parser).
192270
193271.. ipython :: python
194272 :suppress:
@@ -500,11 +578,10 @@ Date Handling
500578Specifying Date Columns
501579+++++++++++++++++++++++
502580
503- To better facilitate working with datetime data,
504- :func: `~pandas.io.parsers.read_csv ` and :func: `~pandas.io.parsers.read_table `
505- uses the keyword arguments ``parse_dates `` and ``date_parser `` to allow users
506- to specify a variety of columns and date/time formats to turn the input text
507- data into ``datetime `` objects.
581+ To better facilitate working with datetime data, :func: `read_csv ` and
582+ :func: `read_table ` use the keyword arguments ``parse_dates `` and ``date_parser ``
583+ to allow users to specify a variety of columns and date/time formats to turn the
584+ input text data into ``datetime `` objects.
508585
509586The simplest case is to just pass in ``parse_dates=True ``:
510587
@@ -929,10 +1006,9 @@ should pass the ``escapechar`` option:
9291006Files with Fixed Width Columns
9301007''''''''''''''''''''''''''''''
9311008
932- While ``read_csv `` reads delimited data, the :func: `~pandas.io.parsers.read_fwf `
933- function works with data files that have known and fixed column widths.
934- The function parameters to ``read_fwf `` are largely the same as `read_csv ` with
935- two extra parameters:
1009+ While ``read_csv `` reads delimited data, the :func: `read_fwf ` function works
1010+ with data files that have known and fixed column widths. The function parameters
1011+ to ``read_fwf `` are largely the same as `read_csv ` with two extra parameters:
9361012
9371013 - ``colspecs ``: A list of pairs (tuples) giving the extents of the
9381014 fixed-width fields of each line as half-open intervals (i.e., [from, to[ ).
0 commit comments