[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10288

yhuai · 2015-12-14T05:01:46Z

This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference.

Regarding the schema inference change, if we have something like

{"f1":1}
[1,2,3]

originally, we will get a DF without any column.
After this change, we will get a DF with columns f1 and _corrupt_record. Basically, for the second row, [1,2,3] will be the value of _corrupt_record.

When merge this PR, please make sure that the author is @simplyianm.

JIRA: https://issues.apache.org/jira/browse/SPARK-12057

Closes #10043

yhuai · 2015-12-14T05:02:56Z

@simplyianm @srowen I create this PR based on #10043. Please see yhuai@d8722bb for my change. This PR should fix the test. It also makes the schema inference more robust to records that we cannot parse.

SparkQA · 2015-12-14T06:37:55Z

Test build #47641 has finished for PR 10288 at commit d8722bb.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-14T08:57:17Z

Test build #47646 has finished for PR 10288 at commit 9971300.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2015-12-14T23:15:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

Should the other error cases be handled this way too? The String -> float/double conversions.

yeah, it is a good point. I added the change.

Return failed record when a record cannot be parsed. Allows parsing of files containing corrupt records of any form.

SparkQA · 2015-12-17T00:32:19Z

Test build #47863 has finished for PR 10288 at commit ad71433.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2015-12-17T00:58:57Z

LGTM

rxin · 2015-12-17T07:13:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala

this should technically extend RuntimeException

rxin · 2015-12-17T07:14:53Z

I'm going to merge this into master & branch-1.6.

This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference. Regarding the schema inference change, if we have something like ``` {"f1":1} [1,2,3] ``` originally, we will get a DF without any column. After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`. When merge this PR, please make sure that the author is simplyianm. JIRA: https://issues.apache.org/jira/browse/SPARK-12057 Closes #10043 Author: Ian Macalinao <[email protected]> Author: Yin Huai <[email protected]> Closes #10288 from yhuai/handleCorruptJson. (cherry picked from commit 9d66c42) Signed-off-by: Reynold Xin <[email protected]>

nongli reviewed Dec 14, 2015
View reviewed changes

macalinao and others added 4 commits December 16, 2015 14:56

Prevent failure on corrupt JSON records

03fe63a

Return failed record when a record cannot be parsed. Allows parsing of files containing corrupt records of any form.

Add regression test for corrupt record JSON parsing

3dd4b7b

Correct schema

d774bfe

Handle more cases.

ad71433

rxin reviewed Dec 17, 2015
View reviewed changes

asfgit closed this in 9d66c42 Dec 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10288

[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10288

Uh oh!

yhuai commented Dec 14, 2015

Uh oh!

yhuai commented Dec 14, 2015

Uh oh!

SparkQA commented Dec 14, 2015

Uh oh!

SparkQA commented Dec 14, 2015

Uh oh!

nongli Dec 14, 2015

Uh oh!

yhuai Dec 16, 2015

Uh oh!

SparkQA commented Dec 17, 2015

Uh oh!

nongli commented Dec 17, 2015

Uh oh!

rxin Dec 17, 2015

Uh oh!

rxin commented Dec 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10288

[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10288

Uh oh!

Conversation

yhuai commented Dec 14, 2015

Uh oh!

yhuai commented Dec 14, 2015

Uh oh!

SparkQA commented Dec 14, 2015

Uh oh!

SparkQA commented Dec 14, 2015

Uh oh!

nongli Dec 14, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Dec 16, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 17, 2015

Uh oh!

nongli commented Dec 17, 2015

Uh oh!

rxin Dec 17, 2015

Choose a reason for hiding this comment

Uh oh!

rxin commented Dec 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants