-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-44079][SQL] Fix ArrayIndexOutOfBoundsException when parse array as struct using PERMISSIVE mode with corrupt record
#41662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @MaxGekk |
…ray as struct using PERMISSIVE mode with corrupt record
15a061c to
32d8b87
Compare
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Hisoka-X Thank you for the ping. Could you enable GAs in your fork: https://github.com/apache/spark/pull/41662/checks?check_run_id=14375296234
|
Hi @MaxGekk ,please check https://github.com/Hisoka-X/spark/actions/runs/5312527795 temporarily. The CI already started, something wrong with github after I forch push code. I will fix this after all CI done. |
|
@MaxGekk Hi, CI passed. Please help me to review. Thanks! |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Outdated
Show resolved
Hide resolved
66b4e99 to
9a3a8c2
Compare
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
Outdated
Show resolved
Hide resolved
|
LGTM in general. cc @HyukjinKwon |
…rces/json/JsonSuite.scala Co-authored-by: Maxim Gekk <[email protected]>
|
+1, LGTM. Merging to master. |
|
Thanks @MaxGekk |
|
@Hisoka-X Does 3.4 has the issue? If so, please, open a PR with the backport. |
…ray as struct using PERMISSIVE mode with corrupt record
### What changes were proposed in this pull request?
When use PERMISSIVE mode to parse json array as structs with `_corrupt_record`. The error will be reported.
```scala
val data = """[{"a": "incorrect", "b": "correct"}, {"a": "incorrect", "b": "correct"}]"""
val schema = new StructType(Array(StructField("a", IntegerType),
StructField("b", StringType), StructField("_corrupt_record", StringType)))
spark.read.option("mode", "PERMISSIVE").option("multiline", "true").schema(schema)
.json(Seq(data).toDS()).show()
// error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 (TID 9) (charlottesmbp2.home executor driver): java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getAs(rows.scala:35)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get(rows.scala:37)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get$(rows.scala:37)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.get(rows.scala:195)
```
The `JacksonParser` wrap row list as a new `InternalRow`. So `FailureSafeParser` can't parse it normally when `_corrupt_record` is defined.
This PR to make sure `BadRecordException` can cover row list parsed by a string record. Then the `FailureSafeParser` can handle row list one by one.
### Why are the changes needed?
Fix the bug when parse array as struct using PERMISSIVE mode with corrupt record
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
add new test.
Closes apache#41662 from Hisoka-X/SPARK-44079_array_as_structs_corrput_record.
Authored-by: Jia Fan <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
(cherry picked from commit 70f3427)
…se array as struct using PERMISSIVE mode with corrupt record ### What changes were proposed in this pull request? cherry pick #41662 , fix parse array as struct bug on branch 3.4 ### Why are the changes needed? Fix the bug when parse array as struct using PERMISSIVE mode with corrupt record ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. Closes #41784 from Hisoka-X/SPARK-44079_3.4_cherry_pick. Authored-by: Jia Fan <[email protected]> Signed-off-by: Max Gekk <[email protected]>
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me too
…se array as struct using PERMISSIVE mode with corrupt record ### What changes were proposed in this pull request? cherry pick apache#41662 , fix parse array as struct bug on branch 3.4 ### Why are the changes needed? Fix the bug when parse array as struct using PERMISSIVE mode with corrupt record ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. Closes apache#41784 from Hisoka-X/SPARK-44079_3.4_cherry_pick. Authored-by: Jia Fan <[email protected]> Signed-off-by: Max Gekk <[email protected]>
What changes were proposed in this pull request?
When use PERMISSIVE mode to parse json array as structs with
_corrupt_record. The error will be reported.The
JacksonParserwrap row list as a newInternalRow. SoFailureSafeParsercan't parse it normally when_corrupt_recordis defined.This PR to make sure
BadRecordExceptioncan cover row list parsed by a string record. Then theFailureSafeParsercan handle row list one by one.Why are the changes needed?
Fix the bug when parse array as struct using PERMISSIVE mode with corrupt record
Does this PR introduce any user-facing change?
No
How was this patch tested?
add new test.