-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12057] [SQL] Prevent failure on corrupt JSON records #10288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@simplyianm @srowen I create this PR based on #10043. Please see yhuai@d8722bb for my change. This PR should fix the test. It also makes the schema inference more robust to records that we cannot parse. |
|
Test build #47641 has finished for PR 10288 at commit
|
|
Test build #47646 has finished for PR 10288 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the other error cases be handled this way too? The String -> float/double conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it is a good point. I added the change.
Return failed record when a record cannot be parsed. Allows parsing of files containing corrupt records of any form.
|
Test build #47863 has finished for PR 10288 at commit
|
|
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should technically extend RuntimeException
|
I'm going to merge this into master & branch-1.6. |
This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference. Regarding the schema inference change, if we have something like ``` {"f1":1} [1,2,3] ``` originally, we will get a DF without any column. After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`. When merge this PR, please make sure that the author is simplyianm. JIRA: https://issues.apache.org/jira/browse/SPARK-12057 Closes #10043 Author: Ian Macalinao <[email protected]> Author: Yin Huai <[email protected]> Closes #10288 from yhuai/handleCorruptJson. (cherry picked from commit 9d66c42) Signed-off-by: Reynold Xin <[email protected]>
This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference.
Regarding the schema inference change, if we have something like
originally, we will get a DF without any column.
After this change, we will get a DF with columns
f1and_corrupt_record. Basically, for the second row,[1,2,3]will be the value of_corrupt_record.When merge this PR, please make sure that the author is @simplyianm.
JIRA: https://issues.apache.org/jira/browse/SPARK-12057
Closes #10043