-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document #20666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4ad330b
4400cf2
1d03d3b
4f9b148
654a59b
daa326d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -209,13 +209,13 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, | |
| :param mode: allows a mode for dealing with corrupt records during parsing. If None is | ||
| set, it uses the default value, ``PERMISSIVE``. | ||
|
|
||
| * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \ | ||
| record, and puts the malformed string into a field configured by \ | ||
| ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \ | ||
| a string type field named ``columnNameOfCorruptRecord`` in an user-defined \ | ||
| schema. If a schema does not have the field, it drops corrupt records during \ | ||
| parsing. When inferring a schema, it implicitly adds a \ | ||
| ``columnNameOfCorruptRecord`` field in an output schema. | ||
| * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \ | ||
| into a field configured by ``columnNameOfCorruptRecord``, and sets other \ | ||
| fields to ``null``. To keep corrupt records, an user can set a string type \ | ||
| field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ | ||
| schema does not have the field, it drops corrupt records during parsing. \ | ||
| When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \ | ||
| field in an output schema. | ||
| * ``DROPMALFORMED`` : ignores the whole corrupted records. | ||
| * ``FAILFAST`` : throws an exception when it meets corrupted records. | ||
|
|
||
|
|
@@ -393,13 +393,15 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non | |
| :param mode: allows a mode for dealing with corrupt records during parsing. If None is | ||
| set, it uses the default value, ``PERMISSIVE``. | ||
|
|
||
| * ``PERMISSIVE`` : sets other fields to ``null`` when it meets a corrupted \ | ||
| record, and puts the malformed string into a field configured by \ | ||
| ``columnNameOfCorruptRecord``. To keep corrupt records, an user can set \ | ||
| a string type field named ``columnNameOfCorruptRecord`` in an \ | ||
| user-defined schema. If a schema does not have the field, it drops corrupt \ | ||
| records during parsing. When a length of parsed CSV tokens is shorter than \ | ||
| an expected length of a schema, it sets `null` for extra fields. | ||
| * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \ | ||
| into a field configured by ``columnNameOfCorruptRecord``, and sets other \ | ||
| fields to ``null``. To keep corrupt records, an user can set a string type \ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can't say
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is talking about a corrupted record, not a record with less/more tokens. If CSV parser fails to parse a record, all other fields are set to null.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, I think we need to explain that, for CSV a record with less/more tokens is not a malformed record.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. Added. |
||
| field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ | ||
| schema does not have the field, it drops corrupt records during parsing. \ | ||
| A record with less/more tokens than schema is not a corrupted record to CSV. \ | ||
| When it meets a record having fewer tokens than the length of the schema, \ | ||
| sets ``null`` to extra fields. When the record has more tokens than the \ | ||
| length of the schema, it drops extra tokens. | ||
| * ``DROPMALFORMED`` : ignores the whole corrupted records. | ||
| * ``FAILFAST`` : throws an exception when it meets corrupted records. | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should say
it implicitly adds ... if a corrupted record is foundwhile we are here? I think it only adds`columnNameOfCorruptRecord`when it meets a corrupted record during schema inference.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When users set a string type field named
columnNameOfCorruptRecordin an user-defined schema, even no corrupted record, I think the field is still added. Or I misread this sentence?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I thought this:
describes schema inference because it adds
columnNameOfCorruptRecordcolumn if malformed record was found during schema inference. I mean ..:but yes I think I misread it. Here we describe things mainly about malformed records already.