[SPARK-23173][SQL] Avoid creating corrupt parquet files when loading data from JSON #20694

mswit-databricks · 2018-02-28T12:50:39Z

What changes were proposed in this pull request?

The from_json() function accepts an additional parameter, where the user might specify the schema. The issue is that the specified schema might not be compatible with data. In particular, the JSON data might be missing data for fields declared as non-nullable in the schema. The from_json() function does not verify the data against such errors. When data with missing fields is sent to the parquet encoder, there is no verification either. The end results is a corrupt parquet file.

To avoid corruptions, make sure that all fields in the user-specified schema are set to be nullable.
Since this changes the behavior of a public function, we need to include it in release notes.
The behavior can be reverted by setting spark.sql.fromJsonForceNullableSchema=false

How was this patch tested?

Added two new tests.

HyukjinKwon · 2018-02-28T12:53:21Z

ok to test

SparkQA · 2018-02-28T16:26:18Z

Test build #87776 has finished for PR 20694 at commit 1cd1919.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-02-28T20:30:02Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

+  test("from_json missing fields") {
+    val conf = SQLConf.get
+    for (forceJsonNullableSchema <- Seq(false, true)) {
+      conf.setConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA, forceJsonNullableSchema)


We have to revert the conflicts to the original value.

Thus, move this test to org.apache.spark.sql.execution.datasources.json.JsonSuite. Then, we can use SQLConf

I tried moving the test, but it is not as easy as it may seem. org.apache.spark.sql.execution.datasources.json.JsonSuite lacks checkEvaluation that I'm using here. In general, org.apache.spark.sql.execution.datasources.json.JsonSuite seems to be meant for things such as spark.read.json() and not the from_json() function, so the test would be misplaced there. I think that it's better to keep the test in JsonExpressionsSuite and revert the config.

gatorsmile · 2018-02-28T20:30:35Z

LGTM except one minor comment in the test case.

SparkQA · 2018-03-09T12:20:43Z

Test build #88121 has finished for PR 20694 at commit 5aec68d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with PlanTestBase

HyukjinKwon · 2018-03-09T12:48:53Z

retest this please

SparkQA · 2018-03-09T16:09:05Z

Test build #88126 has finished for PR 20694 at commit 5aec68d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with PlanTestBase

gatorsmile · 2018-03-09T22:25:41Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

+        val schema = JsonToStructs(jsonSchema, Map.empty, Literal.create(input, StringType), gmtId)
+          .dataType
+        val schemaToCompare = if (forceJsonNullableSchema) jsonSchema.asNullable else jsonSchema
+        assert(schemaToCompare == schema);


Nit: ; is useless.

gatorsmile · 2018-03-09T22:25:51Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala

+            |  "c": "foo"
+            |}
+            |"""
+            .stripMargin


Nit: the style.

gatorsmile · 2018-03-09T22:26:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

+        |  "a": 1,
+        |  "c": "foo"
+        |}
+        |"""


The same here.

gatorsmile · 2018-03-09T22:30:29Z

LGTM

Thanks! Merged to master/2.3

I resolved the style issues when I merged the code

…data from JSON ## What changes were proposed in this pull request? The from_json() function accepts an additional parameter, where the user might specify the schema. The issue is that the specified schema might not be compatible with data. In particular, the JSON data might be missing data for fields declared as non-nullable in the schema. The from_json() function does not verify the data against such errors. When data with missing fields is sent to the parquet encoder, there is no verification either. The end results is a corrupt parquet file. To avoid corruptions, make sure that all fields in the user-specified schema are set to be nullable. Since this changes the behavior of a public function, we need to include it in release notes. The behavior can be reverted by setting `spark.sql.fromJsonForceNullableSchema=false` ## How was this patch tested? Added two new tests. Author: Michał Świtakowski <[email protected]> Closes #20694 from mswit-databricks/SPARK-23173. (cherry picked from commit 2ca9bb0) Signed-off-by: gatorsmile <[email protected]>

…data from JSON ## What changes were proposed in this pull request? The from_json() function accepts an additional parameter, where the user might specify the schema. The issue is that the specified schema might not be compatible with data. In particular, the JSON data might be missing data for fields declared as non-nullable in the schema. The from_json() function does not verify the data against such errors. When data with missing fields is sent to the parquet encoder, there is no verification either. The end results is a corrupt parquet file. To avoid corruptions, make sure that all fields in the user-specified schema are set to be nullable. Since this changes the behavior of a public function, we need to include it in release notes. The behavior can be reverted by setting `spark.sql.fromJsonForceNullableSchema=false` ## How was this patch tested? Added two new tests. Author: Michał Świtakowski <[email protected]> Closes apache#20694 from mswit-databricks/SPARK-23173. (cherry picked from commit 2ca9bb0) Signed-off-by: gatorsmile <[email protected]>

Fix

1cd1919

gatorsmile reviewed Feb 28, 2018

View reviewed changes

Improve the test

5aec68d

gatorsmile reviewed Mar 9, 2018

View reviewed changes

asfgit closed this in 2ca9bb0 Mar 9, 2018

[SPARK-23173][SQL] Avoid creating corrupt parquet files when loading data from JSON #20694

[SPARK-23173][SQL] Avoid creating corrupt parquet files when loading data from JSON #20694

Uh oh!

Conversation

mswit-databricks commented Feb 28, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Feb 28, 2018

Uh oh!

SparkQA commented Feb 28, 2018

Uh oh!

gatorsmile Feb 28, 2018

Choose a reason for hiding this comment

Uh oh!

mswit-databricks Mar 5, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Feb 28, 2018

Uh oh!

SparkQA commented Mar 9, 2018

Uh oh!

HyukjinKwon commented Mar 9, 2018

Uh oh!

SparkQA commented Mar 9, 2018

Uh oh!

gatorsmile Mar 9, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Mar 9, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Mar 9, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Mar 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gatorsmile commented Mar 9, 2018 •

edited

Loading