[SPARK-24445][SQL] Schema in json format for from_json in SQL #21472

MaxGekk · 2018-05-31T19:59:25Z

What changes were proposed in this pull request?

In the PR, I propose to support schema in JSON format for the from_json() function in SQL as it has been already implemented in Scala DSL for example there:

val dataType = try {
  DataType.fromJson(schema)
} catch {
  case NonFatal(_) => StructType.fromDDL(schema)
}

The changes will allow to specify MapType in SQL that's impossible at the moment:

select from_json('{"a":1}', '{"type":"map", "keyType":"string", "valueType":"integer","valueContainsNull":false}')

How was this patch tested?

Added a couple cases to json-functions.sql

SparkQA · 2018-05-31T23:53:59Z

Test build #91361 has finished for PR 21472 at commit 139ef7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-06-01T03:51:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+  def validateSchemaLiteral(exp: Expression): DataType = exp match {
+    case Literal(s, StringType) =>
+      try {
+        DataType.fromJson(s.toString)


I don't think we should support JSON format here. DDL formatted schema is preferred. JSON in functions.scala is supported for backward compatibility because SQL functions wasn't added first. After that, we added SQL functions with DDL formatted schema support.

I believe we should support JSON format because:

Functionality of SQL and Scala (and other languages) DSL should be equal otherwise we push users to use Scala DSL because SQL has less features.

The feature allows to save/restore schema in JSON format. Customer's use case is to have data in JSON format + meta info including schema in JSON format too. Schema in JSON format gives them more opportunities for processing in programatic way.

For now JSON format give us more flexibility and allows MapType (and ArrayType) as the root type for result of from_json

Usually they should be consistent but we don't necessarily support the obsolete functionality newly and consistently. I'm not sure how common it is to write the JSON literal as a schema via SQL. How do they get the metadata and how do they insert it into SQL? Is that the only way to do it?

How do they get the metadata ...

Metadata is stored together with data in distributed fs and loaded by a standard facilities of language.

and how do they insert it into SQL?

SQL statements are formed programmatically as strings, and loaded schemas are inserted in particular positions in the string ( you can think about it as quasiquotes in Scala). The formed sql statements are sent via JDBC to Spark.

Is that the only way to do it?

Probably it is possible to convert schemas in JSON format to DDL format but:

it requires much more effort and time than just modifying 5 lines proposed in the PR

Schema in DDL supports only StructType as root types. It is not possible to specify MapType like in the test: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala#L330-L345

Schema in DDL supports only StructType as root types. It is not possible to specify MapType like in the test:

Shall we add the support with type itself with CatalystSqlParser.parseDataType too?
Also, are you able to use catalogString?

Shall we add the support with type itself with CatalystSqlParser.parseDataType too?

I will do but it won't solve customer's problem fully.

Also, are you able to use catalogString?

I just check that:

val schema = MapType(StringType, IntegerType).catalogString val ds = spark.sql( s""" |select from_json('{"a":1}', '$schema') """.stripMargin) ds.show()

and got this one:

extraneous input '<' expecting {'SELECT', 'FROM', ...}(line 1, pos 3) == SQL == map<string,int> ---^^^ ; line 2 pos 7

The same with val schema = new StructType().add("a", IntegerType).catalogString

== SQL == struct<a:int> ------^^^ ; line 2 pos 7 org.apache.spark.sql.AnalysisException

Am I doing something wrong?

I mean, adding the type string support by CatalystSqlParser.parseDataType (like array<...> or map<...>) into from_json so that this can support struct<a:int> if I am not mistaken.

I mean, like what we do in Python side:

spark/python/pyspark/sql/types.py

Lines 808 to 814 in 4f1e386

try:

# DDL format, "fieldname datatype, fieldname datatype".

return from_ddl_schema(s)

except Exception as e:

try:

# For backwards compatibility, "integer", "struct<fieldname: datatype>" and etc.

return from_ddl_datatype(s)

If possible, I like @HyukjinKwon 's approach. I remember correctly we just keep json schema formats for back-compatibility. In future major releases, I think we possibly drop the support.

@maropu @HyukjinKwon Please, have a look at the PR: #21550

maropu · 2018-06-03T10:00:05Z

sql/core/src/test/resources/sql-tests/inputs/json-functions.sql

+-- from_json - schema in json format
+select from_json('{"a":1}', '{"type":"struct","fields":[{"name":"a","type":"integer", "nullable":true}]}');
+select from_json('{"a":1}', '{"type":"map", "keyType":"string", "valueType":"integer","valueContainsNull":false}');
+


To make the output file changes smaller, can you add new tests in the end of file?

AmplabJenkins · 2018-06-09T00:02:44Z

Can one of the admins verify this patch?

MaxGekk · 2018-06-15T08:29:58Z

@HyukjinKwon Should I close this, or there is a chance that the PR will be merged to keep SQL consistence with Scala, Python and etc.?

HyukjinKwon · 2018-06-15T08:38:48Z

Let's leave this closed for now. I got that there can be a usecase by this now, but let's wait and see if there could be more compelling cases in the future. Currently, I am not positive on this. I assume you are unblocked anyway(?).

FWIW, this was also something implicitly me, @maropu and few(?) or single(?) committer(s) agreed upon, if I remember this correctly. I tried to find the PR or JIRA but failed to find. I think it's legitimate to request where the discussion was made if you feel you need to be sure on this. Please let me know. Will try to find it again.

maropu · 2018-06-15T08:46:48Z

+1

MaxGekk · 2018-06-16T00:20:56Z

@HyukjinKwon @maropu I am trying to use DDL instead of JSON for schema specification. Cannot find how to specify nullable, for example:

valueContainsNull for MapType
nullable for StructField
containsNull for ArrayType

Is DDL really equal to JSON format?

HyukjinKwon · 2018-06-16T00:42:12Z

It's not equal but I mean it's preferred. Does nullability matter in your case and does our Jackson parser properly handle the nullability?

MaxGekk added 2 commits May 31, 2018 20:49

SQL tests for schema in json format

7781620

Support schema in json format

139ef7e

MaxGekk mentioned this pull request May 31, 2018

[SPARK-24391][SQL] Support arrays of any types by from_json #21439

Closed

HyukjinKwon reviewed Jun 1, 2018

View reviewed changes

maropu reviewed Jun 3, 2018

View reviewed changes

MaxGekk closed this Jun 15, 2018

MaxGekk deleted the from_json-sql-schema branch August 17, 2019 13:33

	try:
	# DDL format, "fieldname datatype, fieldname datatype".
	return from_ddl_schema(s)
	except Exception as e:
	try:
	# For backwards compatibility, "integer", "struct<fieldname: datatype>" and etc.
	return from_ddl_datatype(s)

[SPARK-24445][SQL] Schema in json format for from_json in SQL #21472

[SPARK-24445][SQL] Schema in json format for from_json in SQL #21472

Uh oh!

Conversation

MaxGekk commented May 31, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 31, 2018

Uh oh!

HyukjinKwon Jun 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jun 9, 2018

Uh oh!

MaxGekk commented Jun 15, 2018

Uh oh!

HyukjinKwon commented Jun 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented Jun 15, 2018

Uh oh!

MaxGekk commented Jun 16, 2018

Uh oh!

HyukjinKwon commented Jun 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon Jun 1, 2018 •

edited

Loading

HyukjinKwon Jun 1, 2018 •

edited

Loading

HyukjinKwon commented Jun 15, 2018 •

edited

Loading