-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-40646][SQL] Fix returning partial results in JSON data source and JSON functions #38090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @MaxGekk @HyukjinKwon, I would appreciate a review. Let me know if you have any concerns with the changes. Thank you. |
|
My related PR #23253 |
|
Can one of the admins verify this patch? |
| .add("c1", StringType) | ||
| .add("c2", ArrayType(new StructType().add("a", LongType))) | ||
|
|
||
| // Value of "c2.a" is a string instead of a long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain this comment "c2.a" is a string, and doesn't it contradict to "c2": {"a": 1}. Is 1 string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out, the comment is incorrect, it belonged to the original test that I refactored later. I will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
| } catch { | ||
| case e: SparkUpgradeException => throw e | ||
| case NonFatal(e) if isRoot => | ||
| case NonFatal(e) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you allow this, please, check all complex types map, struct. Or have you added some tests already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried covering different cases in:
- SPARK-40646: return partial results for JSON arrays with objects
- SPARK-40646: return partial results for objects with values as JSON arrays
I think I am missing a test for map, will add, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test!
|
@MaxGekk I have addressed your comments, could you review again? I think there is still an issue with complex fields: test("SPARK-40646") {
val st = new StructType()
.add("c1", MapType(StringType, new StructType().add("a1", LongType)))
.add("c2", StringType)
val df = Seq("""{"c1": {"k1": {"a1": 1}, "k2": {"a1": "A"}, "k3": {"a1": 2}}}, "c2": "abc"}""").toDF("c0")
checkAnswer(
df.select(from_json($"c0", st)),
Row(Row(null, "abc"))
)
}would return Would it be okay to address that case in the follow-up? It might require a bit of time to fix that case. |
|
cc @cloud-fan |
Sure. |
|
+1, LGTM. Merging to master. |
|
Thanks for merging! |
…and JSON functions ### What changes were proposed in this pull request? This PR is a follow-up for [SPARK-33134](https://issues.apache.org/jira/browse/SPARK-33134) (apache#30031). I found another case when, depending on the order of columns, parsing one JSON field breaks all of the subsequent fields resulting in all nulls: With a file like this: ``` {"a": {"x": 1, "y": true}, "b": {"x": 1}} {"a": {"x": 2}, "b": {"x": 2}} ``` Reading the file results in column `b` as null even though it is a valid column. ```scala val df = spark.read .schema("a struct<x: int, y: struct<x: int>>, b struct<x: int>") .json("path") === a b null null {"x":2,"y":null} {"x":2} ``` However, b column should be: ``` {"x": 1} {"x": 2} ``` This particular example actually used to work in earlier Spark versions but it was affected by SPARK-33134 which fixed another bug with the incorrect parsing in `from_json`. Because this case was not tested, we missed it at the time. In order to fix both SPARK-33134 and SPARK-40646, we need to process `PartialResultException` in `convertArray` method to handle any errors in child objects. Without the fix, the code would not wrap the row in the array for `from_json` resulting in a ClassCastException (SPARK-33134). Because of this handling, we don't need `isRoot` check anymore in `convertObject` thus unblocking SPARK-40646. I updated the code to handle both cases. With these changes, we can correctly parse this case: ```scala val df3 = Seq("""[{"c2": [19], "c1": 123456}]""").toDF("c0") checkAnswer(df3.select(from_json($"c0", ArrayType(st))), Row(Array(Row(123456, null)))) ``` which was previously returning `null` for the root row. ### Why are the changes needed? Fixes a long-standing issue when parsing a JSON with an incorrect field that would break parsing of the entire record. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added unit tests for SPARK-40646 as well as SPARK-33134. Closes apache#38090 from sadikovi/SPARK-40646. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Max Gekk <[email protected]>
What changes were proposed in this pull request?
This PR is a follow-up for SPARK-33134 (#30031).
I found another case when, depending on the order of columns, parsing one JSON field breaks all of the subsequent fields resulting in all nulls:
With a file like this:
Reading the file results in column
bas null even though it is a valid column.However, b column should be:
This particular example actually used to work in earlier Spark versions but it was affected by SPARK-33134 which fixed another bug with the incorrect parsing in
from_json. Because this case was not tested, we missed it at the time.In order to fix both SPARK-33134 and SPARK-40646, we need to process
PartialResultExceptioninconvertArraymethod to handle any errors in child objects. Without the fix, the code would not wrap the row in the array forfrom_jsonresulting in a ClassCastException (SPARK-33134). Because of this handling, we don't needisRootcheck anymore inconvertObjectthus unblocking SPARK-40646.I updated the code to handle both cases. With these changes, we can correctly parse this case:
which was previously returning
nullfor the root row.Why are the changes needed?
Fixes a long-standing issue when parsing a JSON with an incorrect field that would break parsing of the entire record.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
I added unit tests for SPARK-40646 as well as SPARK-33134.