-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-23448][SQL] Clarify JSON and CSV parser behavior in document #20666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan @HyukjinKwon To keep CSV reader behavior for corrupted records, we don't bother to refactoring. But we should update the document and explicitly disable partial results for corrupted records. |
|
Test build #87641 has finished for PR 20666 at commit
|
|
retest this please. |
| * in an user-defined schema. If a schema does not have the field, it drops corrupt records | ||
| * during parsing. When a length of parsed CSV tokens is shorter than an expected length | ||
| * of a schema, it sets `null` for extra fields.</li> | ||
| * during parsing. It supports partial result for the records just with less or more tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are same instances to update DataStreamReader, readwriter.py and streaming.py too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Will update accordingly.
|
Test build #87642 has finished for PR 20666 at commit
|
|
Test build #87644 has finished for PR 20666 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
python/pyspark/sql/readwriter.py
Outdated
| field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ | ||
| schema does not have the field, it drops corrupt records during parsing. \ | ||
| It supports partial result for the records just with less or more tokens \ | ||
| than the schema. When it meets a malformed record whose parsed tokens is \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a malformed record whose parsed tokens is -> a malformed record having the length of parsed tokens shorter than the length of a schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
| fields to ``null``. To keep corrupt records, an user can set a string type \ | ||
| field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ | ||
| schema does not have the field, it drops corrupt records during parsing. \ | ||
| When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should say it implicitly adds ... if a corrupted record is found while we are here? I think it only adds `columnNameOfCorruptRecord` when it meets a corrupted record during schema inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When users set a string type field named columnNameOfCorruptRecord in an user-defined schema, even no corrupted record, I think the field is still added. Or I misread this sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I thought this:
When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` field in an output schema.
describes schema inference because it adds columnNameOfCorruptRecord column if malformed record was found during schema inference. I mean ..:
scala> spark.read.json(Seq("""{"a": 1}""", """{"a":""").toDS).printSchema()
root
|-- _corrupt_record: string (nullable = true)
|-- a: long (nullable = true)
scala> spark.read.json(Seq("""{"a": 1}""").toDS).printSchema()
root
|-- a: long (nullable = true)but yes I think I misread it. Here we describe things mainly about malformed records already.
python/pyspark/sql/readwriter.py
Outdated
| field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ | ||
| schema does not have the field, it drops corrupt records during parsing. \ | ||
| When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \ | ||
| field in an output schema. It doesn't support partial results. Even just one \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's trivial but how about we avoid an abbreviation like dosen't? It's usually what I do for doc although I am not sure if it actually matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
python/pyspark/sql/readwriter.py
Outdated
| field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ | ||
| schema does not have the field, it drops corrupt records during parsing. \ | ||
| When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \ | ||
| field in an output schema. It does not support partial results. Even just one \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can drop the last sentence. The doc is pretty clear saying and sets other fields to null
| an expected length of a schema, it sets `null` for extra fields. | ||
| * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \ | ||
| into a field configured by ``columnNameOfCorruptRecord``, and sets other \ | ||
| fields to ``null``. To keep corrupt records, an user can set a string type \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't say and sets other fields to null, as it's not the case for CSV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is talking about a corrupted record, not a record with less/more tokens. If CSV parser fails to parse a record, all other fields are set to null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I think we need to explain that, for CSV a record with less/more tokens is not a malformed record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Added.
|
Test build #87663 has finished for PR 20666 at commit
|
|
Test build #87689 has finished for PR 20666 at commit
|
|
retest this please.
…On Tue, Feb 27, 2018, 1:43 PM UCB AMPLab ***@***.***> wrote:
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87689/
Test FAILed.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#20666 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEM9_SPWZjQHnwxsJoM6rNQakwJMV1Xks5tY4fYgaJpZM4SRy8S>
.
|
|
retest this please |
|
Test build #87699 has finished for PR 20666 at commit
|
python/pyspark/sql/readwriter.py
Outdated
| fields to ``null``. To keep corrupt records, an user can set a string type \ | ||
| field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ | ||
| schema does not have the field, it drops corrupt records during parsing. \ | ||
| A record with less/more tokens than schema is not a corrupted record. \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. not a corrupted record to CSV.
python/pyspark/sql/readwriter.py
Outdated
| It supports partial result for such records. When it meets a record having \ | ||
| the length of parsed tokens shorter than the length of a schema, it sets \ | ||
| ``null`` for extra fields. When a length of tokens is longer than a schema, \ | ||
| it drops extra tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it meets a record having fewer tokens than the length of the schema, set ``null`` to extra fields.
When the record has more tokens than the length of the schema, it drops extra tokens.
python/pyspark/sql/readwriter.py
Outdated
| field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \ | ||
| schema does not have the field, it drops corrupt records during parsing. \ | ||
| A record with less/more tokens than schema is not a corrupted record. \ | ||
| It supports partial result for such records. When it meets a record having \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It supports partial result for such records. this doesn't look like very useful, I think the following sentences explain this case well.
python/pyspark/sql/readwriter.py
Outdated
| ``columnNameOfCorruptRecord`` field in an output schema. | ||
| * ``PERMISSIVE`` : when it meets a corrupted record, puts the malformed string \ | ||
| into a field configured by ``columnNameOfCorruptRecord``, and sets other \ | ||
| fields to ``null``. It does not support partial results. To keep corrupt \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not support partial results. I think we don't need to mention this for json.
|
Test build #87711 has finished for PR 20666 at commit
|
|
Test build #87720 has finished for PR 20666 at commit
|
fe260c9 to
daa326d
Compare
|
LGTM |
|
Test build #87721 has finished for PR 20666 at commit
|
|
retest this please. |
|
Test build #87735 has finished for PR 20666 at commit
|
## What changes were proposed in this pull request? Clarify JSON and CSV reader behavior in document. JSON doesn't support partial results for corrupted records. CSV only supports partial results for the records with more or less tokens. ## How was this patch tested? Pass existing tests. Author: Liang-Chi Hsieh <[email protected]> Closes #20666 from viirya/SPARK-23448-2. (cherry picked from commit b14993e) Signed-off-by: hyukjinkwon <[email protected]>
|
Merged to master and branch-2.3. |
|
Thanks @HyukjinKwon @cloud-fan! |
|
Hi, guys, I am a spark user. Could i get those datas come back in spark 2+? @viirya |
|
for specify, the json file (Sanity4.json) is code :
then in spark 1.6, result is root but in spark 2.2, result is root |
|
That's not related to this change. The issue itself seems to be a behaviour change between 1.6 and 2.x for treating empty string as null or not in double and float, which is rather a corner case and which looks, yea, an issue. Let me try to fix it while I'm here. |
|
Hi, Thanks for help.
So can I get some other suggestions ? @HyukjinKwon Thanks a lot. |
## What changes were proposed in this pull request? Clarify JSON and CSV reader behavior in document. JSON doesn't support partial results for corrupted records. CSV only supports partial results for the records with more or less tokens. ## How was this patch tested? Pass existing tests. Author: Liang-Chi Hsieh <[email protected]> Closes apache#20666 from viirya/SPARK-23448-2. (cherry picked from commit b14993e) Signed-off-by: hyukjinkwon <[email protected]>
What changes were proposed in this pull request?
Clarify JSON and CSV reader behavior in document.
JSON doesn't support partial results for corrupted records.
CSV only supports partial results for the records with more or less tokens.
How was this patch tested?
Pass existing tests.