[SPARK-40646][SQL] Fix returning partial results in JSON data source and JSON functions #38090

sadikovi · 2022-10-04T04:20:33Z

What changes were proposed in this pull request?

This PR is a follow-up for SPARK-33134 (#30031).

I found another case when, depending on the order of columns, parsing one JSON field breaks all of the subsequent fields resulting in all nulls:

With a file like this:

{"a": {"x": 1, "y": true}, "b": {"x": 1}}
{"a": {"x": 2}, "b": {"x": 2}}

Reading the file results in column b as null even though it is a valid column.

val df = spark.read
  .schema("a struct<x: int, y: struct<x: int>>, b struct<x: int>")
  .json("path") 

===

a	                b
null	                null
{"x":2,"y":null}	{"x":2}

However, b column should be:

{"x": 1}
{"x": 2}

This particular example actually used to work in earlier Spark versions but it was affected by SPARK-33134 which fixed another bug with the incorrect parsing in from_json. Because this case was not tested, we missed it at the time.

In order to fix both SPARK-33134 and SPARK-40646, we need to process PartialResultException in convertArray method to handle any errors in child objects. Without the fix, the code would not wrap the row in the array for from_json resulting in a ClassCastException (SPARK-33134). Because of this handling, we don't need isRoot check anymore in convertObject thus unblocking SPARK-40646.

I updated the code to handle both cases. With these changes, we can correctly parse this case:

val df3 = Seq("""[{"c2": [19], "c1": 123456}]""").toDF("c0")
checkAnswer(df3.select(from_json($"c0", ArrayType(st))), Row(Array(Row(123456, null))))

which was previously returning null for the root row.

Why are the changes needed?

Fixes a long-standing issue when parsing a JSON with an incorrect field that would break parsing of the entire record.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I added unit tests for SPARK-40646 as well as SPARK-33134.

sadikovi · 2022-10-04T04:28:56Z

cc @MaxGekk @HyukjinKwon, I would appreciate a review. Let me know if you have any concerns with the changes. Thank you.

MaxGekk · 2022-10-04T06:12:05Z

My related PR #23253

AmplabJenkins · 2022-10-05T16:50:58Z

Can one of the admins verify this patch?

MaxGekk · 2022-10-12T06:47:48Z

sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala

+      .add("c1", StringType)
+      .add("c2", ArrayType(new StructType().add("a", LongType)))
+
+    // Value of "c2.a" is a string instead of a long.


Could you explain this comment "c2.a" is a string, and doesn't it contradict to "c2": {"a": 1}. Is 1 string?

Thanks for pointing it out, the comment is incorrect, it belonged to the original test that I refactored later. I will update.

MaxGekk · 2022-10-12T06:49:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

          } catch {
            case e: SparkUpgradeException => throw e
-            case NonFatal(e) if isRoot =>
+            case NonFatal(e) =>


If you allow this, please, check all complex types map, struct. Or have you added some tests already?

I have tried covering different cases in:

SPARK-40646: return partial results for JSON arrays with objects

SPARK-40646: return partial results for objects with values as JSON arrays

I think I am missing a test for map, will add, thanks.

I added a test!

sadikovi · 2022-10-13T22:58:53Z

@MaxGekk I have addressed your comments, could you review again?

I think there is still an issue with complex fields:

test("SPARK-40646") {
  val st = new StructType()
    .add("c1", MapType(StringType, new StructType().add("a1", LongType)))
    .add("c2", StringType)

  val df = Seq("""{"c1": {"k1": {"a1": 1}, "k2": {"a1": "A"}, "k3": {"a1": 2}}}, "c2": "abc"}""").toDF("c0")
  checkAnswer(
    df.select(from_json($"c0", st)),
    Row(Row(null, "abc"))
  )
}

would return null, null instead of null, "abc".

Would it be okay to address that case in the follow-up? It might require a bit of time to fix that case.

sadikovi · 2022-10-15T08:02:30Z

cc @cloud-fan

MaxGekk · 2022-10-17T08:42:51Z

Would it be okay to address that case in the follow-up?

Sure.

MaxGekk · 2022-10-17T08:46:29Z

+1, LGTM. Merging to master.
Thank you, @sadikovi.

sadikovi · 2022-10-17T21:19:30Z

Thanks for merging!

…and JSON functions ### What changes were proposed in this pull request? This PR is a follow-up for [SPARK-33134](https://issues.apache.org/jira/browse/SPARK-33134) (apache#30031). I found another case when, depending on the order of columns, parsing one JSON field breaks all of the subsequent fields resulting in all nulls: With a file like this: ``` {"a": {"x": 1, "y": true}, "b": {"x": 1}} {"a": {"x": 2}, "b": {"x": 2}} ``` Reading the file results in column `b` as null even though it is a valid column. ```scala val df = spark.read .schema("a struct<x: int, y: struct<x: int>>, b struct<x: int>") .json("path") === a b null null {"x":2,"y":null} {"x":2} ``` However, b column should be: ``` {"x": 1} {"x": 2} ``` This particular example actually used to work in earlier Spark versions but it was affected by SPARK-33134 which fixed another bug with the incorrect parsing in `from_json`. Because this case was not tested, we missed it at the time. In order to fix both SPARK-33134 and SPARK-40646, we need to process `PartialResultException` in `convertArray` method to handle any errors in child objects. Without the fix, the code would not wrap the row in the array for `from_json` resulting in a ClassCastException (SPARK-33134). Because of this handling, we don't need `isRoot` check anymore in `convertObject` thus unblocking SPARK-40646. I updated the code to handle both cases. With these changes, we can correctly parse this case: ```scala val df3 = Seq("""[{"c2": [19], "c1": 123456}]""").toDF("c0") checkAnswer(df3.select(from_json($"c0", ArrayType(st))), Row(Array(Row(123456, null)))) ``` which was previously returning `null` for the root row. ### Why are the changes needed? Fixes a long-standing issue when parsing a JSON with an incorrect field that would break parsing of the entire record. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added unit tests for SPARK-40646 as well as SPARK-33134. Closes apache#38090 from sadikovi/SPARK-40646. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Max Gekk <[email protected]>

github-actions bot added the SQL label Oct 4, 2022

sadikovi added 2 commits October 4, 2022 18:22

update code

ce5c33e

fix scalastyle

d250eda

sadikovi force-pushed the SPARK-40646 branch from b0dfd47 to d250eda Compare October 4, 2022 05:38

Merge remote-tracking branch 'upstream/master' into SPARK-40646

e227b7a

MaxGekk reviewed Oct 12, 2022

View reviewed changes

handle maps

ffd336f

MaxGekk approved these changes Oct 17, 2022

View reviewed changes

MaxGekk closed this in 31721ba Oct 17, 2022

[SPARK-40646][SQL] Fix returning partial results in JSON data source and JSON functions #38090

[SPARK-40646][SQL] Fix returning partial results in JSON data source and JSON functions #38090

Uh oh!

Conversation

sadikovi commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sadikovi commented Oct 4, 2022

Uh oh!

MaxGekk commented Oct 4, 2022

Uh oh!

AmplabJenkins commented Oct 5, 2022

Uh oh!

MaxGekk Oct 12, 2022

Choose a reason for hiding this comment

Uh oh!

sadikovi Oct 13, 2022

Choose a reason for hiding this comment

Uh oh!

sadikovi Oct 13, 2022

Choose a reason for hiding this comment

Uh oh!

MaxGekk Oct 12, 2022

Choose a reason for hiding this comment

Uh oh!

sadikovi Oct 13, 2022

Choose a reason for hiding this comment

Uh oh!

sadikovi Oct 13, 2022

Choose a reason for hiding this comment

Uh oh!

sadikovi commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sadikovi commented Oct 15, 2022

Uh oh!

MaxGekk commented Oct 17, 2022

Uh oh!

MaxGekk commented Oct 17, 2022

Uh oh!

sadikovi commented Oct 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sadikovi commented Oct 4, 2022 •

edited

Loading

sadikovi commented Oct 13, 2022 •

edited

Loading