Skip to content

Conversation

@Hisoka-X
Copy link
Member

What changes were proposed in this pull request?

When use PERMISSIVE mode to parse json array as structs with _corrupt_record. The error will be reported.

val data = """[{"a": "incorrect", "b": "correct"}, {"a": "incorrect", "b": "correct"}]"""
val schema = new StructType(Array(StructField("a", IntegerType), 
  StructField("b", StringType), StructField("_corrupt_record", StringType)))

spark.read.option("mode", "PERMISSIVE").option("multiline", "true").schema(schema)
  .json(Seq(data).toDS()).show()

// error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 (TID 9) (charlottesmbp2.home executor driver): java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
        at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getAs(rows.scala:35)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get(rows.scala:37)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get$(rows.scala:37)
        at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.get(rows.scala:195)

The JacksonParser wrap row list as a new InternalRow. So FailureSafeParser can't parse it normally when _corrupt_record is defined.
This PR to make sure BadRecordException can cover row list parsed by a string record. Then the FailureSafeParser can handle row list one by one.

Why are the changes needed?

Fix the bug when parse array as struct using PERMISSIVE mode with corrupt record

Does this PR introduce any user-facing change?

No

How was this patch tested?

add new test.

@github-actions github-actions bot added the SQL label Jun 19, 2023
@Hisoka-X
Copy link
Member Author

cc @MaxGekk

…ray as struct using PERMISSIVE mode with corrupt record
@Hisoka-X Hisoka-X force-pushed the SPARK-44079_array_as_structs_corrput_record branch from 15a061c to 32d8b87 Compare June 19, 2023 13:32
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hisoka-X Thank you for the ping. Could you enable GAs in your fork: https://github.com/apache/spark/pull/41662/checks?check_run_id=14375296234

@Hisoka-X
Copy link
Member Author

Hi @MaxGekk ,please check https://github.com/Hisoka-X/spark/actions/runs/5312527795 temporarily. The CI already started, something wrong with github after I forch push code. I will fix this after all CI done.

@Hisoka-X
Copy link
Member Author

@MaxGekk Hi, CI passed. Please help me to review. Thanks!

@Hisoka-X Hisoka-X force-pushed the SPARK-44079_array_as_structs_corrput_record branch from 66b4e99 to 9a3a8c2 Compare June 27, 2023 12:19
@Hisoka-X Hisoka-X requested a review from MaxGekk June 28, 2023 07:39
@MaxGekk
Copy link
Member

MaxGekk commented Jun 28, 2023

LGTM in general. cc @HyukjinKwon

@MaxGekk
Copy link
Member

MaxGekk commented Jun 29, 2023

+1, LGTM. Merging to master.
Thank you, @Hisoka-X.

@MaxGekk MaxGekk closed this in 70f3427 Jun 29, 2023
@Hisoka-X
Copy link
Member Author

Thanks @MaxGekk

@MaxGekk
Copy link
Member

MaxGekk commented Jun 29, 2023

@Hisoka-X Does 3.4 has the issue? If so, please, open a PR with the backport.

@Hisoka-X Hisoka-X deleted the SPARK-44079_array_as_structs_corrput_record branch June 29, 2023 06:41
Hisoka-X added a commit to Hisoka-X/spark that referenced this pull request Jun 29, 2023
…ray as struct using PERMISSIVE mode with corrupt record

### What changes were proposed in this pull request?
When use PERMISSIVE mode to parse json array as structs with `_corrupt_record`. The error will be reported.
```scala
val data = """[{"a": "incorrect", "b": "correct"}, {"a": "incorrect", "b": "correct"}]"""
val schema = new StructType(Array(StructField("a", IntegerType),
  StructField("b", StringType), StructField("_corrupt_record", StringType)))

spark.read.option("mode", "PERMISSIVE").option("multiline", "true").schema(schema)
  .json(Seq(data).toDS()).show()

// error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 2.0 failed 1 times, most recent failure: Lost task 4.0 in stage 2.0 (TID 9) (charlottesmbp2.home executor driver): java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
        at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.genericGet(rows.scala:201)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getAs(rows.scala:35)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get(rows.scala:37)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.get$(rows.scala:37)
        at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.get(rows.scala:195)
```
The `JacksonParser` wrap row list as a new `InternalRow`. So `FailureSafeParser` can't parse it normally when `_corrupt_record` is defined.
This PR to make sure `BadRecordException` can cover row list parsed by a string record. Then the `FailureSafeParser` can handle row list one by one.

### Why are the changes needed?
Fix the bug when parse array as struct using PERMISSIVE mode with corrupt record

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

Closes apache#41662 from Hisoka-X/SPARK-44079_array_as_structs_corrput_record.

Authored-by: Jia Fan <[email protected]>
Signed-off-by: Max Gekk <[email protected]>

(cherry picked from commit 70f3427)
MaxGekk pushed a commit that referenced this pull request Jun 29, 2023
…se array as struct using PERMISSIVE mode with corrupt record

### What changes were proposed in this pull request?
cherry pick #41662 , fix  parse array as struct bug on branch 3.4
### Why are the changes needed?
Fix the bug when parse array as struct using PERMISSIVE mode with corrupt record

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

Closes #41784 from Hisoka-X/SPARK-44079_3.4_cherry_pick.

Authored-by: Jia Fan <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me too

viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
…se array as struct using PERMISSIVE mode with corrupt record

### What changes were proposed in this pull request?
cherry pick apache#41662 , fix  parse array as struct bug on branch 3.4
### Why are the changes needed?
Fix the bug when parse array as struct using PERMISSIVE mode with corrupt record

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

Closes apache#41784 from Hisoka-X/SPARK-44079_3.4_cherry_pick.

Authored-by: Jia Fan <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants