-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-3339][SQL] Support for skipping json lines that fail to parse #2680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2680 at commit
|
|
QA tests have finished for PR 2680 at commit
|
|
Test PASSed. |
|
QA tests have started for PR 2680 at commit
|
|
QA tests have finished for PR 2680 at commit
|
|
Test PASSed. |
|
@rxin suggests we prefix the special column with |
|
Also maybe the setting doesn't need to include "JSON" in it. We can use this for other things in the future too. |
|
Test FAILed. |
|
retest this please. |
|
QA tests have started for PR 2680 at commit
|
|
QA tests have finished for PR 2680 at commit
|
|
Test PASSed. |
|
Can one of the admins verify this patch? |
|
Thanks! I've merged to master. |
This PR aims to provide a way to skip/query corrupt JSON records. To do so, we introduce an internal column to hold corrupt records (the default name is
_corrupt_record. This name can be changed by setting the value ofspark.sql.columnNameOfCorruptRecord). When there is a parsing error, we will put the corrupt record in its unparsed format to the internal column. Users can skip/query this column through SQL.Generally, it is not recommended to change the name of the internal column. If the name has to be changed to avoid possible name conflicts, you can use
sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>)orsqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>).