Skip to content

Conversation

@yhuai
Copy link
Contributor

@yhuai yhuai commented Oct 6, 2014

This PR aims to provide a way to skip/query corrupt JSON records. To do so, we introduce an internal column to hold corrupt records (the default name is _corrupt_record. This name can be changed by setting the value of spark.sql.columnNameOfCorruptRecord). When there is a parsing error, we will put the corrupt record in its unparsed format to the internal column. Users can skip/query this column through SQL.

  • To query those corrupt records
-- For Hive parser
SELECT `_corrupt_record`
FROM jsonTable
WHERE `_corrupt_record` IS NOT NULL
-- For our SQL parser
SELECT _corrupt_record
FROM jsonTable
WHERE _corrupt_record IS NOT NULL
  • To skip corrupt records and query regular records
-- For Hive parser
SELECT field1, field2
FROM jsonTable
WHERE `_corrupt_record` IS NULL
-- For our SQL parser
SELECT field1, field2
FROM jsonTable
WHERE _corrupt_record IS NULL

Generally, it is not recommended to change the name of the internal column. If the name has to be changed to avoid possible name conflicts, you can use sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>) or sqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>).

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have started for PR 2680 at commit ee584c0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 6, 2014

QA tests have finished for PR 2680 at commit ee584c0.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21335/Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2680 at commit b4a3632.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2680 at commit b4a3632.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21464/Test PASSed.

@marmbrus
Copy link
Contributor

marmbrus commented Oct 9, 2014

@rxin suggests we prefix the special column with _ to avoid collisions better.

@rxin
Copy link
Contributor

rxin commented Oct 9, 2014

Also maybe the setting doesn't need to include "JSON" in it. We can use this for other things in the future too.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21501/Test FAILed.

@yhuai
Copy link
Contributor Author

yhuai commented Oct 9, 2014

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2680 at commit 4c9828e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2680 at commit 4c9828e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class AbstractParams[T: TypeTag]
    • class SparkIMain(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21506/Test PASSed.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@marmbrus
Copy link
Contributor

marmbrus commented Oct 9, 2014

Thanks! I've merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants