-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24391][SQL] Support arrays of any types by from_json #21439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3a7559b
b601a93
02a97ac
f34bd88
f062dd2
86d2f20
9d0230a
181dcae
6d54cf0
e321e37
1c657e8
b5b0d9c
ce9918b
fced8ec
0fd0fb9
e49ee9d
2bca7e0
f3efb1b
82d4fd5
758d1df
8349ca8
2746d35
bc3a2dd
39a0a4e
9c2681a
021350b
89719c0
bdfd8a1
74a7799
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -61,6 +61,7 @@ class JacksonParser( | |
| dt match { | ||
| case st: StructType => makeStructRootConverter(st) | ||
| case mt: MapType => makeMapRootConverter(mt) | ||
| case at: ArrayType => makeArrayRootConverter(at) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change accepts the json datasource form that the master can't parse? If so, I think we need tests in
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, it accept arrays of any types comparing to the master which accepts arrays of structs only
I added a few tests to
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You've already added tests in
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I didn't catch you meant specific class. Just in case, what is the reason for adding tests to
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, ok. You touched the the |
||
| } | ||
| } | ||
|
|
||
|
|
@@ -101,6 +102,35 @@ class JacksonParser( | |
| } | ||
| } | ||
|
|
||
| private def makeArrayRootConverter(at: ArrayType): JsonParser => Seq[InternalRow] = { | ||
| val elemConverter = makeConverter(at.elementType) | ||
| (parser: JsonParser) => parseJsonToken[Seq[InternalRow]](parser, at) { | ||
| case START_ARRAY => Seq(InternalRow(convertArray(parser, elemConverter))) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In line 87: Should we also follow this?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The code in line 87 returns In case when schema is |
||
| case START_OBJECT if at.elementType.isInstanceOf[StructType] => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we need this? This means that if the user asks for an array of data and the data doesn't contain an array we return an array with a single element. This seems wrong to me. I'd rather return null or throw an exception.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is for backward compatibility to the behavior introduced by this PR: #16929, most likely by this: https://github.com/apache/spark/pull/16929/files#diff-6626026091295ad8c0dfb66ecbcd04b1R506 . Even special test was added: https://github.com/apache/spark/pull/16929/files#diff-88230f171af0b7a40791a867f9dd3a36R382 . Please, ask author @HyukjinKwon about reasons for the changes.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If an array of data is empty, and the schema is an array, it should return an empty array, https://github.com/apache/spark/pull/16929/files#diff-88230f171af0b7a40791a867f9dd3a36R389 Returning empty object or array is not even introduced by me anyway in JSON datasource.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is a super weird case...can we put this special handling code for back-compatibility in
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think what was done in that PR is pretty different and it is actually the opposite of what we are doing here. Indeed, there we are returning an array of structs when a struct is specified as schema and the JSON contains an array. Here we are returning an array with one struct when the schema is an array of struct and there is a struct instead of an array. Despite I don't really like the behavior introduced in the PR you mentioned, I can understand it, as it was a way to support array of struct (the only at the moment) and I don't think we can change it before 3.0 at least for backward compatibility. But since here we are introducing a new behavior, if an array is required and a struct is found, I think returning an array with one element is a wrong/unexpected behavior and returning null would be what I'd expect as a user.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can only say that this @maropu Initially I put the behavior under a flag in
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shall we add a comment on top of this
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am adding a comment for this. |
||
| // This handles the case when an input JSON object is a structure but | ||
| // the specified schema is an array of structures. In that case, the input JSON is | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add an example here, like what we did in |
||
| // considered as an array of only one element of struct type. | ||
| // This behavior was introduced by changes for SPARK-19595. | ||
| // | ||
| // For example, if the specified schema is ArrayType(new StructType().add("i", IntegerType)) | ||
| // and JSON input as below: | ||
| // | ||
| // [{"i": 1}, {"i": 2}] | ||
| // [{"i": 3}] | ||
| // {"i": 4} | ||
| // | ||
| // The last row is considered as an array with one element, and result of conversion: | ||
| // | ||
| // Seq(Row(1), Row(2)) | ||
| // Seq(Row(3)) | ||
| // Seq(Row(4)) | ||
| // | ||
| val st = at.elementType.asInstanceOf[StructType] | ||
| val fieldConverters = st.map(_.dataType).map(makeConverter).toArray | ||
| Seq(InternalRow(new GenericArrayData(Seq(convertObject(parser, st, fieldConverters))))) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Create a converter which converts the JSON documents held by the `JsonParser` | ||
| * to a value according to a desired schema. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -39,3 +39,15 @@ select from_json('{"a":1, "b":"2"}', 'struct<a:int,b:string>'); | |
| -- infer schema of json literal | ||
| select schema_of_json('{"c1":0, "c2":[1]}'); | ||
| select from_json('{"c1":[1, 2, 3]}', schema_of_json('{"c1":[0]}')); | ||
|
|
||
| -- from_json - array type | ||
| select from_json('[1, 2, 3]', 'array<int>'); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add more cases ?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added |
||
| select from_json('[1, "2", 3]', 'array<int>'); | ||
| select from_json('[1, 2, null]', 'array<int>'); | ||
|
|
||
| select from_json('[{"a": 1}, {"a":2}]', 'array<struct<a:int>>'); | ||
| select from_json('{"a": 1}', 'array<struct<a:int>>'); | ||
| select from_json('[null, {"a":2}]', 'array<struct<a:int>>'); | ||
|
|
||
| select from_json('[{"a": 1}, {"b":2}]', 'array<map<string,int>>'); | ||
| select from_json('[{"a": 1}, 2]', 'array<map<string,int>>'); | ||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also update the comment of
JsonToStructs:Converts an json input string to a [[StructType]] or [[ArrayType]] of [[StructType]]s.