Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented May 31, 2018

What changes were proposed in this pull request?

In the PR, I propose to support schema in JSON format for the from_json() function in SQL as it has been already implemented in Scala DSL for example there:

val dataType = try {
  DataType.fromJson(schema)
} catch {
  case NonFatal(_) => StructType.fromDDL(schema)
}

The changes will allow to specify MapType in SQL that's impossible at the moment:

select from_json('{"a":1}', '{"type":"map", "keyType":"string", "valueType":"integer","valueContainsNull":false}')

How was this patch tested?

Added a couple cases to json-functions.sql

@SparkQA
Copy link

SparkQA commented May 31, 2018

Test build #91361 has finished for PR 21472 at commit 139ef7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def validateSchemaLiteral(exp: Expression): DataType = exp match {
case Literal(s, StringType) =>
try {
DataType.fromJson(s.toString)
Copy link
Member

@HyukjinKwon HyukjinKwon Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should support JSON format here. DDL formatted schema is preferred. JSON in functions.scala is supported for backward compatibility because SQL functions wasn't added first. After that, we added SQL functions with DDL formatted schema support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should support JSON format because:

  • Functionality of SQL and Scala (and other languages) DSL should be equal otherwise we push users to use Scala DSL because SQL has less features.
  • The feature allows to save/restore schema in JSON format. Customer's use case is to have data in JSON format + meta info including schema in JSON format too. Schema in JSON format gives them more opportunities for processing in programatic way.
  • For now JSON format give us more flexibility and allows MapType (and ArrayType) as the root type for result of from_json

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually they should be consistent but we don't necessarily support the obsolete functionality newly and consistently. I'm not sure how common it is to write the JSON literal as a schema via SQL. How do they get the metadata and how do they insert it into SQL? Is that the only way to do it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do they get the metadata ...

Metadata is stored together with data in distributed fs and loaded by a standard facilities of language.

and how do they insert it into SQL?

SQL statements are formed programmatically as strings, and loaded schemas are inserted in particular positions in the string ( you can think about it as quasiquotes in Scala). The formed sql statements are sent via JDBC to Spark.

Is that the only way to do it?

Probably it is possible to convert schemas in JSON format to DDL format but:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema in DDL supports only StructType as root types. It is not possible to specify MapType like in the test:

Shall we add the support with type itself with CatalystSqlParser.parseDataType too?
Also, are you able to use catalogString?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add the support with type itself with CatalystSqlParser.parseDataType too?

I will do but it won't solve customer's problem fully.

Also, are you able to use catalogString?

I just check that:

val schema = MapType(StringType, IntegerType).catalogString
val ds = spark.sql(
      s"""
        |select from_json('{"a":1}', '$schema')
      """.stripMargin)
ds.show()

and got this one:

extraneous input '<' expecting {'SELECT', 'FROM', ...}(line 1, pos 3)

== SQL ==
map<string,int>
---^^^
; line 2 pos 7

The same with val schema = new StructType().add("a", IntegerType).catalogString

== SQL ==
struct<a:int>
------^^^
; line 2 pos 7
org.apache.spark.sql.AnalysisException

Am I doing something wrong?

Copy link
Member

@HyukjinKwon HyukjinKwon Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, adding the type string support by CatalystSqlParser.parseDataType (like array<...> or map<...>) into from_json so that this can support struct<a:int> if I am not mistaken.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, like what we do in Python side:

try:
# DDL format, "fieldname datatype, fieldname datatype".
return from_ddl_schema(s)
except Exception as e:
try:
# For backwards compatibility, "integer", "struct<fieldname: datatype>" and etc.
return from_ddl_datatype(s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I like @HyukjinKwon 's approach. I remember correctly we just keep json schema formats for back-compatibility. In future major releases, I think we possibly drop the support.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu @HyukjinKwon Please, have a look at the PR: #21550

-- from_json - schema in json format
select from_json('{"a":1}', '{"type":"struct","fields":[{"name":"a","type":"integer", "nullable":true}]}');
select from_json('{"a":1}', '{"type":"map", "keyType":"string", "valueType":"integer","valueContainsNull":false}');

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the output file changes smaller, can you add new tests in the end of file?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@MaxGekk
Copy link
Member Author

MaxGekk commented Jun 15, 2018

@HyukjinKwon Should I close this, or there is a chance that the PR will be merged to keep SQL consistence with Scala, Python and etc.?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jun 15, 2018

Let's leave this closed for now. I got that there can be a usecase by this now, but let's wait and see if there could be more compelling cases in the future. Currently, I am not positive on this. I assume you are unblocked anyway(?).

FWIW, this was also something implicitly me, @maropu and few(?) or single(?) committer(s) agreed upon, if I remember this correctly. I tried to find the PR or JIRA but failed to find. I think it's legitimate to request where the discussion was made if you feel you need to be sure on this. Please let me know. Will try to find it again.

@maropu
Copy link
Member

maropu commented Jun 15, 2018

+1

@MaxGekk MaxGekk closed this Jun 15, 2018
@MaxGekk
Copy link
Member Author

MaxGekk commented Jun 16, 2018

@HyukjinKwon @maropu I am trying to use DDL instead of JSON for schema specification. Cannot find how to specify nullable, for example:

  • valueContainsNull for MapType
  • nullable for StructField
  • containsNull for ArrayType

Is DDL really equal to JSON format?

@HyukjinKwon
Copy link
Member

It's not equal but I mean it's preferred. Does nullability matter in your case and does our Jackson parser properly handle the nullability?

@MaxGekk MaxGekk deleted the from_json-sql-schema branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants