Skip to content

Conversation

@sadikovi
Copy link
Contributor

@sadikovi sadikovi commented Oct 4, 2022

What changes were proposed in this pull request?

This PR is a follow-up for SPARK-33134 (#30031).

I found another case when, depending on the order of columns, parsing one JSON field breaks all of the subsequent fields resulting in all nulls:

With a file like this:

{"a": {"x": 1, "y": true}, "b": {"x": 1}}
{"a": {"x": 2}, "b": {"x": 2}}

Reading the file results in column b as null even though it is a valid column.

val df = spark.read
  .schema("a struct<x: int, y: struct<x: int>>, b struct<x: int>")
  .json("path") 

===

a	                b
null	                null
{"x":2,"y":null}	{"x":2} 

However, b column should be:

{"x": 1}
{"x": 2}

This particular example actually used to work in earlier Spark versions but it was affected by SPARK-33134 which fixed another bug with the incorrect parsing in from_json. Because this case was not tested, we missed it at the time.

In order to fix both SPARK-33134 and SPARK-40646, we need to process PartialResultException in convertArray method to handle any errors in child objects. Without the fix, the code would not wrap the row in the array for from_json resulting in a ClassCastException (SPARK-33134). Because of this handling, we don't need isRoot check anymore in convertObject thus unblocking SPARK-40646.

I updated the code to handle both cases. With these changes, we can correctly parse this case:

val df3 = Seq("""[{"c2": [19], "c1": 123456}]""").toDF("c0")
checkAnswer(df3.select(from_json($"c0", ArrayType(st))), Row(Array(Row(123456, null))))

which was previously returning null for the root row.

Why are the changes needed?

Fixes a long-standing issue when parsing a JSON with an incorrect field that would break parsing of the entire record.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I added unit tests for SPARK-40646 as well as SPARK-33134.

@github-actions github-actions bot added the SQL label Oct 4, 2022
@sadikovi
Copy link
Contributor Author

sadikovi commented Oct 4, 2022

cc @MaxGekk @HyukjinKwon, I would appreciate a review. Let me know if you have any concerns with the changes. Thank you.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 4, 2022

My related PR #23253

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

.add("c1", StringType)
.add("c2", ArrayType(new StructType().add("a", LongType)))

// Value of "c2.a" is a string instead of a long.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this comment "c2.a" is a string, and doesn't it contradict to "c2": {"a": 1}. Is 1 string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out, the comment is incorrect, it belonged to the original test that I refactored later. I will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

} catch {
case e: SparkUpgradeException => throw e
case NonFatal(e) if isRoot =>
case NonFatal(e) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you allow this, please, check all complex types map, struct. Or have you added some tests already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried covering different cases in:

  • SPARK-40646: return partial results for JSON arrays with objects
  • SPARK-40646: return partial results for objects with values as JSON arrays

I think I am missing a test for map, will add, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test!

@sadikovi
Copy link
Contributor Author

sadikovi commented Oct 13, 2022

@MaxGekk I have addressed your comments, could you review again?

I think there is still an issue with complex fields:

test("SPARK-40646") {
  val st = new StructType()
    .add("c1", MapType(StringType, new StructType().add("a1", LongType)))
    .add("c2", StringType)

  val df = Seq("""{"c1": {"k1": {"a1": 1}, "k2": {"a1": "A"}, "k3": {"a1": 2}}}, "c2": "abc"}""").toDF("c0")
  checkAnswer(
    df.select(from_json($"c0", st)),
    Row(Row(null, "abc"))
  )
}

would return null, null instead of null, "abc".

Would it be okay to address that case in the follow-up? It might require a bit of time to fix that case.

@sadikovi
Copy link
Contributor Author

cc @cloud-fan

@MaxGekk
Copy link
Member

MaxGekk commented Oct 17, 2022

Would it be okay to address that case in the follow-up?

Sure.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 17, 2022

+1, LGTM. Merging to master.
Thank you, @sadikovi.

@MaxGekk MaxGekk closed this in 31721ba Oct 17, 2022
@sadikovi
Copy link
Contributor Author

Thanks for merging!

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…and JSON functions

### What changes were proposed in this pull request?

This PR is a follow-up for [SPARK-33134](https://issues.apache.org/jira/browse/SPARK-33134) (apache#30031).

I found another case when, depending on the order of columns, parsing one JSON field breaks all of the subsequent fields resulting in all nulls:

With a file like this:
```
{"a": {"x": 1, "y": true}, "b": {"x": 1}}
{"a": {"x": 2}, "b": {"x": 2}}
```

Reading the file results in column `b` as null even though it is a valid column.
```scala
val df = spark.read
  .schema("a struct<x: int, y: struct<x: int>>, b struct<x: int>")
  .json("path")

===

a	                b
null	                null
{"x":2,"y":null}	{"x":2}
```

However, b column should be:
```
{"x": 1}
{"x": 2}
```

This particular example actually used to work in earlier Spark versions but it was affected by SPARK-33134 which fixed another bug with the incorrect parsing in `from_json`. Because this case was not tested, we missed it at the time.

In order to fix both SPARK-33134 and SPARK-40646, we need to process `PartialResultException` in `convertArray` method to handle any errors in child objects. Without the fix, the code would not wrap the row in the array for `from_json` resulting in a ClassCastException (SPARK-33134). Because of this handling, we don't need `isRoot` check anymore in `convertObject` thus unblocking SPARK-40646.

I updated the code to handle both cases. With these changes, we can correctly parse this case:
```scala
val df3 = Seq("""[{"c2": [19], "c1": 123456}]""").toDF("c0")
checkAnswer(df3.select(from_json($"c0", ArrayType(st))), Row(Array(Row(123456, null))))
```
which was previously returning `null` for the root row.

### Why are the changes needed?

Fixes a long-standing issue when parsing a JSON with an incorrect field that would break parsing of the entire record.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added unit tests for SPARK-40646 as well as SPARK-33134.

Closes apache#38090 from sadikovi/SPARK-40646.

Authored-by: Ivan Sadikov <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants