[SPARK-3339][SQL] Support for skipping json lines that fail to parse #2680

yhuai · 2014-10-06T17:53:32Z

This PR aims to provide a way to skip/query corrupt JSON records. To do so, we introduce an internal column to hold corrupt records (the default name is _corrupt_record. This name can be changed by setting the value of spark.sql.columnNameOfCorruptRecord). When there is a parsing error, we will put the corrupt record in its unparsed format to the internal column. Users can skip/query this column through SQL.

To query those corrupt records

-- For Hive parser
SELECT `_corrupt_record`
FROM jsonTable
WHERE `_corrupt_record` IS NOT NULL
-- For our SQL parser
SELECT _corrupt_record
FROM jsonTable
WHERE _corrupt_record IS NOT NULL

To skip corrupt records and query regular records

-- For Hive parser
SELECT field1, field2
FROM jsonTable
WHERE `_corrupt_record` IS NULL
-- For our SQL parser
SELECT field1, field2
FROM jsonTable
WHERE _corrupt_record IS NULL

Generally, it is not recommended to change the name of the internal column. If the name has to be changed to avoid possible name conflicts, you can use sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>) or sqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>).

SparkQA · 2014-10-06T17:59:41Z

QA tests have started for PR 2680 at commit ee584c0.

This patch merges cleanly.

SparkQA · 2014-10-06T18:51:09Z

QA tests have finished for PR 2680 at commit ee584c0.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-06T18:51:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21335/Test PASSed.

…ter the unit test.

SparkQA · 2014-10-08T13:54:42Z

QA tests have started for PR 2680 at commit b4a3632.

This patch merges cleanly.

SparkQA · 2014-10-08T14:41:12Z

QA tests have finished for PR 2680 at commit b4a3632.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-08T14:41:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21464/Test PASSed.

marmbrus · 2014-10-09T00:34:04Z

@rxin suggests we prefix the special column with _ to avoid collisions better.

rxin · 2014-10-09T00:36:38Z

Also maybe the setting doesn't need to include "JSON" in it. We can use this for other things in the future too.

AmplabJenkins · 2014-10-09T01:17:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21501/Test FAILed.

yhuai · 2014-10-09T01:25:07Z

retest this please.

SparkQA · 2014-10-09T01:30:04Z

QA tests have started for PR 2680 at commit 4c9828e.

This patch merges cleanly.

SparkQA · 2014-10-09T02:16:09Z

QA tests have finished for PR 2680 at commit 4c9828e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class AbstractParams[T: TypeTag]
- class SparkIMain(

AmplabJenkins · 2014-10-09T02:16:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21506/Test PASSed.

AmplabJenkins · 2014-10-09T20:32:30Z

Can one of the admins verify this patch?

marmbrus · 2014-10-09T21:58:06Z

Thanks! I've merged to master.

Provide a way to query corrupt json records as unparsed strings.

ee584c0

yhuai added 2 commits October 8, 2014 09:50

Set the column name of corrupt json record back to the default one af…

9375ae9

…ter the unit test.

Merge remote-tracking branch 'upstream/master' into corruptJsonRecord

b4a3632

yhuai added 2 commits October 8, 2014 21:00

Change the default name of corrupt record to "_corrupt_record".

309616a

Merge remote-tracking branch 'upstream/master' into corruptJsonRecord

4c9828e

asfgit closed this in 1c7f0ab Oct 9, 2014

yhuai deleted the corruptJsonRecord branch October 10, 2014 13:05

gatorsmile mentioned this pull request Sep 5, 2017

[SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file #18865

Closed

[SPARK-3339][SQL] Support for skipping json lines that fail to parse #2680

[SPARK-3339][SQL] Support for skipping json lines that fail to parse #2680

Uh oh!

Conversation

yhuai commented Oct 6, 2014

Uh oh!

SparkQA commented Oct 6, 2014

Uh oh!

SparkQA commented Oct 6, 2014

Uh oh!

AmplabJenkins commented Oct 6, 2014

Uh oh!

SparkQA commented Oct 8, 2014

Uh oh!

SparkQA commented Oct 8, 2014

Uh oh!

AmplabJenkins commented Oct 8, 2014

Uh oh!

marmbrus commented Oct 9, 2014

Uh oh!

rxin commented Oct 9, 2014

Uh oh!

AmplabJenkins commented Oct 9, 2014

Uh oh!

yhuai commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

SparkQA commented Oct 9, 2014

Uh oh!

AmplabJenkins commented Oct 9, 2014

Uh oh!

AmplabJenkins commented Oct 9, 2014

Uh oh!

marmbrus commented Oct 9, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants