Skip to content

Conversation

@mswit-databricks
Copy link
Contributor

What changes were proposed in this pull request?

The from_json() function accepts an additional parameter, where the user might specify the schema. The issue is that the specified schema might not be compatible with data. In particular, the JSON data might be missing data for fields declared as non-nullable in the schema. The from_json() function does not verify the data against such errors. When data with missing fields is sent to the parquet encoder, there is no verification either. The end results is a corrupt parquet file.

To avoid corruptions, make sure that all fields in the user-specified schema are set to be nullable.
Since this changes the behavior of a public function, we need to include it in release notes.
The behavior can be reverted by setting spark.sql.fromJsonForceNullableSchema=false

How was this patch tested?

Added two new tests.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Feb 28, 2018

Test build #87776 has finished for PR 20694 at commit 1cd1919.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

test("from_json missing fields") {
val conf = SQLConf.get
for (forceJsonNullableSchema <- Seq(false, true)) {
conf.setConf(SQLConf.FROM_JSON_FORCE_NULLABLE_SCHEMA, forceJsonNullableSchema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to revert the conflicts to the original value.

Thus, move this test to org.apache.spark.sql.execution.datasources.json.JsonSuite. Then, we can use SQLConf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried moving the test, but it is not as easy as it may seem. org.apache.spark.sql.execution.datasources.json.JsonSuite lacks checkEvaluation that I'm using here. In general, org.apache.spark.sql.execution.datasources.json.JsonSuite seems to be meant for things such as spark.read.json() and not the from_json() function, so the test would be misplaced there. I think that it's better to keep the test in JsonExpressionsSuite and revert the config.

@gatorsmile
Copy link
Member

LGTM except one minor comment in the test case.

@SparkQA
Copy link

SparkQA commented Mar 9, 2018

Test build #88121 has finished for PR 20694 at commit 5aec68d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with PlanTestBase

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 9, 2018

Test build #88126 has finished for PR 20694 at commit 5aec68d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with PlanTestBase

val schema = JsonToStructs(jsonSchema, Map.empty, Literal.create(input, StringType), gmtId)
.dataType
val schemaToCompare = if (forceJsonNullableSchema) jsonSchema.asNullable else jsonSchema
assert(schemaToCompare == schema);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ; is useless.

| "c": "foo"
|}
|"""
.stripMargin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the style.

| "a": 1,
| "c": "foo"
|}
|"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here.

@gatorsmile
Copy link
Member

gatorsmile commented Mar 9, 2018

LGTM

Thanks! Merged to master/2.3

I resolved the style issues when I merged the code

@asfgit asfgit closed this in 2ca9bb0 Mar 9, 2018
asfgit pushed a commit that referenced this pull request Mar 9, 2018
…data from JSON

## What changes were proposed in this pull request?

The from_json() function accepts an additional parameter, where the user might specify the schema. The issue is that the specified schema might not be compatible with data. In particular, the JSON data might be missing data for fields declared as non-nullable in the schema. The from_json() function does not verify the data against such errors. When data with missing fields is sent to the parquet encoder, there is no verification either. The end results is a corrupt parquet file.

To avoid corruptions, make sure that all fields in the user-specified schema are set to be nullable.
Since this changes the behavior of a public function, we need to include it in release notes.
The behavior can be reverted by setting `spark.sql.fromJsonForceNullableSchema=false`

## How was this patch tested?

Added two new tests.

Author: Michał Świtakowski <[email protected]>

Closes #20694 from mswit-databricks/SPARK-23173.

(cherry picked from commit 2ca9bb0)
Signed-off-by: gatorsmile <[email protected]>
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
…data from JSON

## What changes were proposed in this pull request?

The from_json() function accepts an additional parameter, where the user might specify the schema. The issue is that the specified schema might not be compatible with data. In particular, the JSON data might be missing data for fields declared as non-nullable in the schema. The from_json() function does not verify the data against such errors. When data with missing fields is sent to the parquet encoder, there is no verification either. The end results is a corrupt parquet file.

To avoid corruptions, make sure that all fields in the user-specified schema are set to be nullable.
Since this changes the behavior of a public function, we need to include it in release notes.
The behavior can be reverted by setting `spark.sql.fromJsonForceNullableSchema=false`

## How was this patch tested?

Added two new tests.

Author: Michał Świtakowski <[email protected]>

Closes apache#20694 from mswit-databricks/SPARK-23173.

(cherry picked from commit 2ca9bb0)
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants