Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented May 26, 2018

What changes were proposed in this pull request?

The PR removes a restriction for element types of array type which exists in from_json for the root type. Currently, the function can handle only arrays of structs. Even array of primitive types is disallowed. The PR allows arrays of any types currently supported by JSON datasource. Here is an example of an array of a primitive type:

scala> import org.apache.spark.sql.functions._
scala> val df = Seq("[1, 2, 3]").toDF("a")
scala> val schema = new ArrayType(IntegerType, false)
scala> val arr = df.select(from_json($"a", schema))
scala> arr.printSchema
root
 |-- jsontostructs(a): array (nullable = true)
 |    |-- element: integer (containsNull = true)

and result of converting of the json string to the ArrayType:

scala> arr.show
+----------------+
|jsontostructs(a)|
+----------------+
|       [1, 2, 3]|
+----------------+

How was this patch tested?

I added a few positive and negative tests:

  • array of primitive types
  • array of arrays
  • array of structs
  • array of maps

@SparkQA
Copy link

SparkQA commented May 26, 2018

Test build #91189 has finished for PR 21439 at commit b601a93.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 27, 2018

Test build #91191 has finished for PR 21439 at commit 02a97ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented May 28, 2018

better to add tests in json-functions.sql?

@maropu
Copy link
Member

maropu commented May 29, 2018

Can we also accept primitive arrays in to_json?

// can generate incorrect files if values are missing in columns declared as non-nullable.
val nullableSchema = if (forceNullableSchema) schema.asNullable else schema

val unpackArray: Boolean = options.get("unpackArray").map(_.toBoolean).getOrElse(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private? (This is not related to this pr though, nullableSchema also can be private?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the option unpackArray case-insensitive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add this new option here, I feel we'd be better to document somewhere (e.g., sq/functions.scala)

@MaxGekk
Copy link
Member Author

MaxGekk commented May 30, 2018

Thank you @maropu for your review of the PR.

better to add tests in json-functions.sql?

What kind of tests would you expect in json-functions.sql. Probably you would expect tests that are different from added to JsonExpressionsSuite.scala.

Can we also accept primitive arrays in to_json?

I believe it should be implemented in another PR because the changes required for to_json don't intersect with this PR.

@gatorsmile
Copy link
Member

cc @gengliangwang

@SparkQA
Copy link

SparkQA commented May 31, 2018

Test build #91350 has finished for PR 21439 at commit 9d0230a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

retest this please.

@MaxGekk
Copy link
Member Author

MaxGekk commented May 31, 2018

@maropu For now it is impossible to specify schema for from_json if it is not StructType. The PR #21472 solves the problem.

@SparkQA
Copy link

SparkQA commented May 31, 2018

Test build #91356 has finished for PR 21439 at commit 9d0230a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


override def checkInputDataTypes(): TypeCheckResult = nullableSchema match {
case _: StructType | ArrayType(_: StructType, _) | _: MapType =>
case ArrayType(_: StructType, _) if unpackArray =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if unpackArray is false, the next branch in line 558 still do super.checkInputDataTypes() for any ArrayType

private def makeArrayRootConverter(at: ArrayType): JsonParser => Seq[InternalRow] = {
val elemConverter = makeConverter(at.elementType)
(parser: JsonParser) => parseJsonToken[Seq[InternalRow]](parser, at) {
case START_ARRAY => Seq(InternalRow(convertArray(parser, elemConverter)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 87:

        val array = convertArray(parser, elementConverter)
        // Here, as we support reading top level JSON arrays and take every element
        // in such an array as a row, this case is possible.
        if (array.numElements() == 0) {
          Nil
        } else {
          array.toArray[InternalRow](schema).toSeq
        }

Should we also follow this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in line 87 returns null for json input [] if schema is StructType(StructField("a", IntegerType) :: Nil). I would explain why we should return null in that case: we extract struct from the array. If the array is empty, it means there is nothing to extract and we returns null for the nothing.

In case when schema is ArrayType(...), I believe we should return empty array for empty JSON array []

val nullableSchema = if (forceNullableSchema) schema.asNullable else schema

private val caseInsensitiveOptions = CaseInsensitiveMap(options)
private val unpackArray: Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? Can you add comments about it?

val output = InternalRow(1) :: Nil
checkEvaluation(JsonToStructs(schema, Map.empty, Literal(input), gmtId, true), output)
checkEvaluation(
JsonToStructs(schema, Map("unpackArray" -> "true"), Literal(input), gmtId, true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add case for unpackArray as false

@maropu
Copy link
Member

maropu commented Jun 3, 2018

What kind of tests would you expect in json-functions.sql. Probably you would expect tests
that are different from added to JsonExpressionsSuite.scala.

IIUC that's because we need SQL parser tests for these kinds of SQL related functionality.

@SparkQA
Copy link

SparkQA commented Jun 11, 2018

Test build #91668 has finished for PR 21439 at commit e321e37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jun 13, 2018

@maropu I need changes from this PR #21550 (or this #21472) to write SQL tests for ArrayType

MaxGekk added 3 commits June 15, 2018 00:51
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
@SparkQA
Copy link

SparkQA commented Jul 29, 2018

Test build #93734 has finished for PR 21439 at commit 021350b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 30, 2018

Is there anything for now which blocks the PR?

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 3, 2018

@gatorsmile @HyukjinKwon May I ask you to look at the PR one more time.

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 5, 2018

@HyukjinKwon Are there any chances the PR will be merged? or I should close it?

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 10, 2018

@gatorsmile Could you look at the PR, please.

case START_ARRAY => Seq(InternalRow(convertArray(parser, elemConverter)))
case START_OBJECT if at.elementType.isInstanceOf[StructType] =>
// This handles the case when an input JSON object is a structure but
// the specified schema is an array of structures. In that case, the input JSON is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an example here, like what we did in makeStructRootConverter ?

select from_json('{"c1":[1, 2, 3]}', schema_of_json('{"c1":[0]}'));

-- from_json - array type
select from_json('[1, 2, 3]', 'array<int>');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more cases ?
select from_json('[3, null, 4]', 'array')
select from_json('[3, "str", 4]', 'array')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@gatorsmile
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94655 has finished for PR 21439 at commit 74a7799.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94659 has finished for PR 21439 at commit 74a7799.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Aug 13, 2018

retest this please.

@viirya
Copy link
Member

viirya commented Aug 13, 2018

LGTM too.

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94666 has finished for PR 21439 at commit 74a7799.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94677 has finished for PR 21439 at commit 74a7799.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 13, 2018

Test build #94680 has finished for PR 21439 at commit 74a7799.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in ab06c25 Aug 13, 2018
@viirya
Copy link
Member

viirya commented Aug 13, 2018

I think R side is not update for this yet. @huaxingao would you like to do that?

@huaxingao
Copy link
Contributor

Sure. I will work on it. Thanks for letting me know. @viirya

jzhuge pushed a commit to jzhuge/spark that referenced this pull request Mar 7, 2019
The PR removes a restriction for element types of array type which exists in `from_json` for the root type. Currently, the function can handle only arrays of structs. Even array of primitive types is disallowed. The PR allows arrays of any types currently supported by JSON datasource. Here is an example of an array of a primitive type:

```
scala> import org.apache.spark.sql.functions._
scala> val df = Seq("[1, 2, 3]").toDF("a")
scala> val schema = new ArrayType(IntegerType, false)
scala> val arr = df.select(from_json($"a", schema))
scala> arr.printSchema
root
 |-- jsontostructs(a): array (nullable = true)
 |    |-- element: integer (containsNull = true)
```
and result of converting of the json string to the `ArrayType`:
```
scala> arr.show
+----------------+
|jsontostructs(a)|
+----------------+
|       [1, 2, 3]|
+----------------+
```

I added a few positive and negative tests:
- array of primitive types
- array of arrays
- array of structs
- array of maps

Closes apache#21439 from MaxGekk/from_json-array.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
@MaxGekk MaxGekk deleted the from_json-array branch August 17, 2019 13:33
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
The PR removes a restriction for element types of array type which exists in `from_json` for the root type. Currently, the function can handle only arrays of structs. Even array of primitive types is disallowed. The PR allows arrays of any types currently supported by JSON datasource. Here is an example of an array of a primitive type:

```
scala> import org.apache.spark.sql.functions._
scala> val df = Seq("[1, 2, 3]").toDF("a")
scala> val schema = new ArrayType(IntegerType, false)
scala> val arr = df.select(from_json($"a", schema))
scala> arr.printSchema
root
 |-- jsontostructs(a): array (nullable = true)
 |    |-- element: integer (containsNull = true)
```
and result of converting of the json string to the `ArrayType`:
```
scala> arr.show
+----------------+
|jsontostructs(a)|
+----------------+
|       [1, 2, 3]|
+----------------+
```

I added a few positive and negative tests:
- array of primitive types
- array of arrays
- array of structs
- array of maps

Closes apache#21439 from MaxGekk/from_json-array.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants