[SPARK-24391][SQL] Support arrays of any types by from_json #21439

MaxGekk · 2018-05-26T17:32:00Z

What changes were proposed in this pull request?

The PR removes a restriction for element types of array type which exists in from_json for the root type. Currently, the function can handle only arrays of structs. Even array of primitive types is disallowed. The PR allows arrays of any types currently supported by JSON datasource. Here is an example of an array of a primitive type:

scala> import org.apache.spark.sql.functions._
scala> val df = Seq("[1, 2, 3]").toDF("a")
scala> val schema = new ArrayType(IntegerType, false)
scala> val arr = df.select(from_json($"a", schema))
scala> arr.printSchema
root
 |-- jsontostructs(a): array (nullable = true)
 |    |-- element: integer (containsNull = true)

and result of converting of the json string to the ArrayType:

scala> arr.show
+----------------+
|jsontostructs(a)|
+----------------+
|       [1, 2, 3]|
+----------------+

How was this patch tested?

I added a few positive and negative tests:

array of primitive types
array of arrays
array of structs
array of maps

SparkQA · 2018-05-26T20:31:11Z

Test build #91189 has finished for PR 21439 at commit b601a93.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-27T01:05:01Z

Test build #91191 has finished for PR 21439 at commit 02a97ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-05-28T15:32:41Z

better to add tests in json-functions.sql?

maropu · 2018-05-29T03:43:46Z

Can we also accept primitive arrays in to_json?

maropu · 2018-05-29T03:48:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

  // can generate incorrect files if values are missing in columns declared as non-nullable.
  val nullableSchema = if (forceNullableSchema) schema.asNullable else schema

+  val unpackArray: Boolean = options.get("unpackArray").map(_.toBoolean).getOrElse(false)


private? (This is not related to this pr though, nullableSchema also can be private?)

Can you make the option unpackArray case-insensitive?

If we add this new option here, I feel we'd be better to document somewhere (e.g., sq/functions.scala)

MaxGekk · 2018-05-30T15:31:17Z

Thank you @maropu for your review of the PR.

better to add tests in json-functions.sql?

What kind of tests would you expect in json-functions.sql. Probably you would expect tests that are different from added to JsonExpressionsSuite.scala.

Can we also accept primitive arrays in to_json?

I believe it should be implemented in another PR because the changes required for to_json don't intersect with this PR.

gatorsmile · 2018-05-30T16:10:01Z

cc @gengliangwang

SparkQA · 2018-05-31T18:27:44Z

Test build #91350 has finished for PR 21439 at commit 9d0230a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-05-31T18:31:03Z

retest this please.

MaxGekk · 2018-05-31T20:00:56Z

@maropu For now it is impossible to specify schema for from_json if it is not StructType. The PR #21472 solves the problem.

SparkQA · 2018-05-31T22:37:07Z

Test build #91356 has finished for PR 21439 at commit 9d0230a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-06-01T17:35:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala


  override def checkInputDataTypes(): TypeCheckResult = nullableSchema match {
-    case _: StructType | ArrayType(_: StructType, _) | _: MapType =>
+    case ArrayType(_: StructType, _) if unpackArray =>


Even if unpackArray is false, the next branch in line 558 still do super.checkInputDataTypes() for any ArrayType

gengliangwang · 2018-06-01T19:04:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

+  private def makeArrayRootConverter(at: ArrayType): JsonParser => Seq[InternalRow] = {
+    val elemConverter = makeConverter(at.elementType)
+    (parser: JsonParser) => parseJsonToken[Seq[InternalRow]](parser, at) {
+      case START_ARRAY => Seq(InternalRow(convertArray(parser, elemConverter)))


In line 87:

val array = convertArray(parser, elementConverter) // Here, as we support reading top level JSON arrays and take every element // in such an array as a row, this case is possible. if (array.numElements() == 0) { Nil } else { array.toArray[InternalRow](schema).toSeq }

Should we also follow this?

The code in line 87 returns null for json input [] if schema is StructType(StructField("a", IntegerType) :: Nil). I would explain why we should return null in that case: we extract struct from the array. If the array is empty, it means there is nothing to extract and we returns null for the nothing.

In case when schema is ArrayType(...), I believe we should return empty array for empty JSON array []

gengliangwang · 2018-06-01T20:04:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

  val nullableSchema = if (forceNullableSchema) schema.asNullable else schema

+  private val caseInsensitiveOptions = CaseInsensitiveMap(options)
+  private val unpackArray: Boolean = {


Why do we need this? Can you add comments about it?

gengliangwang · 2018-06-01T20:05:28Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

    val output = InternalRow(1) :: Nil
-    checkEvaluation(JsonToStructs(schema, Map.empty, Literal(input), gmtId, true), output)
+    checkEvaluation(
+      JsonToStructs(schema, Map("unpackArray" -> "true"), Literal(input), gmtId, true),


add case for unpackArray as false

maropu · 2018-06-03T09:19:37Z

What kind of tests would you expect in json-functions.sql. Probably you would expect tests
that are different from added to JsonExpressionsSuite.scala.

IIUC that's because we need SQL parser tests for these kinds of SQL related functionality.

SparkQA · 2018-06-11T22:48:39Z

Test build #91668 has finished for PR 21439 at commit e321e37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-06-13T08:38:07Z

@maropu I need changes from this PR #21550 (or this #21472) to write SQL tests for ArrayType

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

SparkQA · 2018-07-29T00:59:10Z

Test build #93734 has finished for PR 21439 at commit 021350b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-07-30T18:09:35Z

Is there anything for now which blocks the PR?

MaxGekk · 2018-08-03T07:14:21Z

@gatorsmile @HyukjinKwon May I ask you to look at the PR one more time.

MaxGekk · 2018-08-05T13:41:02Z

@HyukjinKwon Are there any chances the PR will be merged? or I should close it?

MaxGekk · 2018-08-10T07:16:07Z

@gatorsmile Could you look at the PR, please.

gatorsmile · 2018-08-12T16:47:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

+      case START_ARRAY => Seq(InternalRow(convertArray(parser, elemConverter)))
+      case START_OBJECT if at.elementType.isInstanceOf[StructType] =>
+        // This handles the case when an input JSON object is a structure but
+        // the specified schema is an array of structures. In that case, the input JSON is


Could you add an example here, like what we did in makeStructRootConverter ?

gatorsmile · 2018-08-12T16:52:59Z

sql/core/src/test/resources/sql-tests/inputs/json-functions.sql

 select from_json('{"c1":[1, 2, 3]}', schema_of_json('{"c1":[0]}'));
+
+-- from_json - array type
+select from_json('[1, 2, 3]', 'array<int>');


Add more cases ?
select from_json('[3, null, 4]', 'array')
select from_json('[3, "str", 4]', 'array')

gatorsmile · 2018-08-12T16:53:26Z

LGTM

SparkQA · 2018-08-13T00:48:28Z

Test build #94655 has finished for PR 21439 at commit 74a7799.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM

HyukjinKwon · 2018-08-13T01:07:32Z

retest this please

SparkQA · 2018-08-13T02:59:11Z

Test build #94659 has finished for PR 21439 at commit 74a7799.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-08-13T03:39:41Z

retest this please.

viirya · 2018-08-13T03:48:51Z

LGTM too.

SparkQA · 2018-08-13T06:56:26Z

Test build #94666 has finished for PR 21439 at commit 74a7799.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-13T07:09:02Z

retest this please

SparkQA · 2018-08-13T08:00:19Z

Test build #94677 has finished for PR 21439 at commit 74a7799.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-13T08:06:26Z

retest this please

SparkQA · 2018-08-13T12:08:25Z

Test build #94680 has finished for PR 21439 at commit 74a7799.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-13T12:12:45Z

Merged to master.

viirya · 2018-08-13T18:39:11Z

I think R side is not update for this yet. @huaxingao would you like to do that?

huaxingao · 2018-08-13T20:57:39Z

Sure. I will work on it. Thanks for letting me know. @viirya

The PR removes a restriction for element types of array type which exists in `from_json` for the root type. Currently, the function can handle only arrays of structs. Even array of primitive types is disallowed. The PR allows arrays of any types currently supported by JSON datasource. Here is an example of an array of a primitive type: ``` scala> import org.apache.spark.sql.functions._ scala> val df = Seq("[1, 2, 3]").toDF("a") scala> val schema = new ArrayType(IntegerType, false) scala> val arr = df.select(from_json($"a", schema)) scala> arr.printSchema root |-- jsontostructs(a): array (nullable = true) | |-- element: integer (containsNull = true) ``` and result of converting of the json string to the `ArrayType`: ``` scala> arr.show +----------------+ |jsontostructs(a)| +----------------+ | [1, 2, 3]| +----------------+ ``` I added a few positive and negative tests: - array of primitive types - array of arrays - array of structs - array of maps Closes apache#21439 from MaxGekk/from_json-array. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

MaxGekk added 2 commits May 26, 2018 18:46

Support arrays by from_json

3a7559b

Fix comments

b601a93

Support array of struct unpacking for backward compatibility

02a97ac

Merge remote-tracking branch 'origin/master' into from_json-array

f34bd88

maropu reviewed May 29, 2018

View reviewed changes

MaxGekk added 3 commits May 31, 2018 16:15

Merge remote-tracking branch 'origin/master' into from_json-array

f062dd2

Added case insensitive options for jsonToStruct

86d2f20

Making added values private

9d0230a

gengliangwang reviewed Jun 1, 2018

View reviewed changes

MaxGekk added 3 commits June 11, 2018 19:34

Unnecessary check of input params is removed

181dcae

Added comment for the unpackArray config

6d54cf0

Added a test when schema is array type but json input is {}

e321e37

MaxGekk added 3 commits June 15, 2018 00:51

Merge remote-tracking branch 'origin/master' into from_json-array

1c657e8

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

Making imports shorter

b5b0d9c

SQL tests for arrays

ce9918b

gatorsmile reviewed Aug 12, 2018

View reviewed changes

MaxGekk added 3 commits August 12, 2018 22:17

Merge remote-tracking branch 'origin/master' into from_json-array

89719c0

Added an example

bdfd8a1

A few negative SQL tests

74a7799

HyukjinKwon approved these changes Aug 13, 2018

View reviewed changes

asfgit closed this in ab06c25 Aug 13, 2018

MaxGekk deleted the from_json-array branch August 17, 2019 13:33

[SPARK-24391][SQL] Support arrays of any types by from_json #21439

[SPARK-24391][SQL] Support arrays of any types by from_json #21439

Uh oh!

Conversation

MaxGekk commented May 26, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 26, 2018

Uh oh!

SparkQA commented May 27, 2018

Uh oh!

maropu commented May 28, 2018

Uh oh!

maropu commented May 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented May 30, 2018

Uh oh!

gatorsmile commented May 30, 2018

Uh oh!

SparkQA commented May 31, 2018

Uh oh!

gengliangwang commented May 31, 2018

Uh oh!

MaxGekk commented May 31, 2018

Uh oh!

SparkQA commented May 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Jun 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 11, 2018

Uh oh!

MaxGekk commented Jun 13, 2018

Uh oh!

SparkQA commented Jul 29, 2018

Uh oh!

MaxGekk commented Jul 30, 2018

Uh oh!

MaxGekk commented Aug 3, 2018

Uh oh!

MaxGekk commented Aug 5, 2018

Uh oh!

MaxGekk commented Aug 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 12, 2018

Uh oh!

SparkQA commented Aug 13, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 13, 2018

Uh oh!

SparkQA commented Aug 13, 2018

Uh oh!

viirya commented Aug 13, 2018

Uh oh!

maropu commented Jun 3, 2018 •

edited

Loading