-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24445][SQL] Schema in json format for from_json in SQL #21472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #91361 has finished for PR 21472 at commit
|
| def validateSchemaLiteral(exp: Expression): DataType = exp match { | ||
| case Literal(s, StringType) => | ||
| try { | ||
| DataType.fromJson(s.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should support JSON format here. DDL formatted schema is preferred. JSON in functions.scala is supported for backward compatibility because SQL functions wasn't added first. After that, we added SQL functions with DDL formatted schema support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should support JSON format because:
- Functionality of SQL and Scala (and other languages) DSL should be equal otherwise we push users to use Scala DSL because SQL has less features.
- The feature allows to save/restore schema in JSON format. Customer's use case is to have data in JSON format + meta info including schema in JSON format too. Schema in JSON format gives them more opportunities for processing in programatic way.
- For now JSON format give us more flexibility and allows
MapType(andArrayType) as the root type for result offrom_json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually they should be consistent but we don't necessarily support the obsolete functionality newly and consistently. I'm not sure how common it is to write the JSON literal as a schema via SQL. How do they get the metadata and how do they insert it into SQL? Is that the only way to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do they get the metadata ...
Metadata is stored together with data in distributed fs and loaded by a standard facilities of language.
and how do they insert it into SQL?
SQL statements are formed programmatically as strings, and loaded schemas are inserted in particular positions in the string ( you can think about it as quasiquotes in Scala). The formed sql statements are sent via JDBC to Spark.
Is that the only way to do it?
Probably it is possible to convert schemas in JSON format to DDL format but:
- it requires much more effort and time than just modifying 5 lines proposed in the PR
- Schema in DDL supports only
StructTypeas root types. It is not possible to specifyMapTypelike in the test: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala#L330-L345
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema in DDL supports only
StructTypeas root types. It is not possible to specifyMapTypelike in the test:
Shall we add the support with type itself with CatalystSqlParser.parseDataType too?
Also, are you able to use catalogString?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add the support with type itself with CatalystSqlParser.parseDataType too?
I will do but it won't solve customer's problem fully.
Also, are you able to use catalogString?
I just check that:
val schema = MapType(StringType, IntegerType).catalogString
val ds = spark.sql(
s"""
|select from_json('{"a":1}', '$schema')
""".stripMargin)
ds.show()and got this one:
extraneous input '<' expecting {'SELECT', 'FROM', ...}(line 1, pos 3)
== SQL ==
map<string,int>
---^^^
; line 2 pos 7
The same with val schema = new StructType().add("a", IntegerType).catalogString
== SQL ==
struct<a:int>
------^^^
; line 2 pos 7
org.apache.spark.sql.AnalysisException
Am I doing something wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, adding the type string support by CatalystSqlParser.parseDataType (like array<...> or map<...>) into from_json so that this can support struct<a:int> if I am not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, like what we do in Python side:
spark/python/pyspark/sql/types.py
Lines 808 to 814 in 4f1e386
| try: | |
| # DDL format, "fieldname datatype, fieldname datatype". | |
| return from_ddl_schema(s) | |
| except Exception as e: | |
| try: | |
| # For backwards compatibility, "integer", "struct<fieldname: datatype>" and etc. | |
| return from_ddl_datatype(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, I like @HyukjinKwon 's approach. I remember correctly we just keep json schema formats for back-compatibility. In future major releases, I think we possibly drop the support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu @HyukjinKwon Please, have a look at the PR: #21550
| -- from_json - schema in json format | ||
| select from_json('{"a":1}', '{"type":"struct","fields":[{"name":"a","type":"integer", "nullable":true}]}'); | ||
| select from_json('{"a":1}', '{"type":"map", "keyType":"string", "valueType":"integer","valueContainsNull":false}'); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the output file changes smaller, can you add new tests in the end of file?
|
Can one of the admins verify this patch? |
|
@HyukjinKwon Should I close this, or there is a chance that the PR will be merged to keep SQL consistence with Scala, Python and etc.? |
|
Let's leave this closed for now. I got that there can be a usecase by this now, but let's wait and see if there could be more compelling cases in the future. Currently, I am not positive on this. I assume you are unblocked anyway(?). FWIW, this was also something implicitly me, @maropu and few(?) or single(?) committer(s) agreed upon, if I remember this correctly. I tried to find the PR or JIRA but failed to find. I think it's legitimate to request where the discussion was made if you feel you need to be sure on this. Please let me know. Will try to find it again. |
|
+1 |
|
@HyukjinKwon @maropu I am trying to use DDL instead of JSON for schema specification. Cannot find how to specify
Is DDL really equal to JSON format? |
|
It's not equal but I mean it's preferred. Does nullability matter in your case and does our Jackson parser properly handle the nullability? |
What changes were proposed in this pull request?
In the PR, I propose to support schema in JSON format for the
from_json()function in SQL as it has been already implemented in Scala DSL for example there:The changes will allow to specify
MapTypein SQL that's impossible at the moment:How was this patch tested?
Added a couple cases to
json-functions.sql