Skip to content

Conversation

@yhuai
Copy link
Contributor

@yhuai yhuai commented Dec 14, 2015

This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference.

Regarding the schema inference change, if we have something like

{"f1":1}
[1,2,3]

originally, we will get a DF without any column.
After this change, we will get a DF with columns f1 and _corrupt_record. Basically, for the second row, [1,2,3] will be the value of _corrupt_record.

When merge this PR, please make sure that the author is @simplyianm.

JIRA: https://issues.apache.org/jira/browse/SPARK-12057

Closes #10043

@yhuai
Copy link
Contributor Author

yhuai commented Dec 14, 2015

@simplyianm @srowen I create this PR based on #10043. Please see yhuai@d8722bb for my change. This PR should fix the test. It also makes the schema inference more robust to records that we cannot parse.

@SparkQA
Copy link

SparkQA commented Dec 14, 2015

Test build #47641 has finished for PR 10288 at commit d8722bb.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2015

Test build #47646 has finished for PR 10288 at commit 9971300.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the other error cases be handled this way too? The String -> float/double conversions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it is a good point. I added the change.

macalinao and others added 4 commits December 16, 2015 14:56
Return failed record when a record cannot be parsed. Allows parsing of files containing corrupt records of any form.
@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47863 has finished for PR 10288 at commit ad71433.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Dec 17, 2015

LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should technically extend RuntimeException

@rxin
Copy link
Contributor

rxin commented Dec 17, 2015

I'm going to merge this into master & branch-1.6.

asfgit pushed a commit that referenced this pull request Dec 17, 2015
This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference.

Regarding the schema inference change, if we have something like
```
{"f1":1}
[1,2,3]
```
originally, we will get a DF without any column.
After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`.

When merge this PR, please make sure that the author is simplyianm.

JIRA: https://issues.apache.org/jira/browse/SPARK-12057

Closes #10043

Author: Ian Macalinao <[email protected]>
Author: Yin Huai <[email protected]>

Closes #10288 from yhuai/handleCorruptJson.

(cherry picked from commit 9d66c42)
Signed-off-by: Reynold Xin <[email protected]>
@asfgit asfgit closed this in 9d66c42 Dec 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants