[SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file #18865

jmchung · 2017-08-07T06:19:22Z

What changes were proposed in this pull request?

echo '{"field": 1}
{"field": 2}
{"field": "3"}' >/tmp/sample.json

import org.apache.spark.sql.types._

val schema = new StructType()
  .add("field", ByteType)
  .add("_corrupt_record", StringType)

val file = "/tmp/sample.json"

val dfFromFile = spark.read.schema(schema).json(file)

scala> dfFromFile.show(false)
+-----+---------------+
|field|_corrupt_record|
+-----+---------------+
|1    |null           |
|2    |null           |
|null |{"field": "3"} |
+-----+---------------+

scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
res1: Long = 0

scala> dfFromFile.filter($"_corrupt_record".isNull).count()
res2: Long = 3

When the requiredSchema only contains _corrupt_record, the derived actualSchema is empty and the _corrupt_record are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query.

How was this patch tested?

Added test case.

…ting a dataframe from a file

viirya · 2017-08-07T06:36:39Z

cc @gatorsmile @cloud-fan Can you help trigger Jenkins for this? Thanks.

cloud-fan · 2017-08-07T15:50:20Z

ok to test

cloud-fan · 2017-08-07T15:52:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala


    (file: PartitionedFile) => {
-      val parser = new JacksonParser(actualSchema, parsedOptions)
+      // SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`,


does this bug apply for csv too?

I think so. But he is a beginner I'm mentoring to contribute to Spark. So we will keep this change focusing on Json. We may deal with csv later. Thanks.

Hm, I understood it fixes the issue described in the JIRA but won't this introduce casting tries for all columns when the requested schema is empty? I think this one is a rather band-aid fix. I mean, I think if the actualSchema has few columns selected from many columns, it'd introduce similar problem again ..

I think there are two issues:

When required schema is empty. This is why the jenkins test fails. We're working on fix it.

If actualSchema selects few columns. We noticed that. The column _corrupted_record is different when the selected columns are different. But correctly we treat it as a designed behavior. Not sure if this is an issue we need to fix.

The column _corrupted_record is different when the selected columns are different

If _corrupted_record is designed to have different values for different selected columns, it may makes sense to set _corrupted_record to null if no columns are selected.

Yeah, I agreed. With current behavior, it is unavoidable to have some strange queries with _corrupted_record.

I'd suggest as #18865 (comment), we should document _corrupted_record in CSV, JSON data sources is a derived column and can be incorrect if not used with other columns.

We need to let users know that _corrupted_record is a derived column from other columns and cannot be selected alone in a query.

@viirya I created issue https://issues.apache.org/jira/browse/SPARK-21610 and need to select field "_corrupt_record" alone. This is possible with spark 2.2 (if a dataframe is created from a RDD) and it would be great to keep this behaviour in future versions of Spark.

My use case is the following one: a spark job reads JSON with an input schema, and will:

save records that match the input schema in parquet format

save "corrupt records" (invalid JSON or records that do match the input schema) to text files, in a separate folder.

Basically, I want :

a folder with clean data in parquet format

another folder with "corrupt records". I can then analyze corrupt records and for instance tell partners that they are sending invalid data. This enables a clean data pipeline that separates valid records from corrupt records

To get valid and corrupt records, I write :

val validRecords = df.filter(col("_corrupt_record").isNull) .drop("_corrupt_record") val corruptRecords = df.filter(col("_corrupt_record").isNotNull) .select("_corrupt_record")

Your usage scenario makes sense to me. The contents of _corrupt_record depends on the fields our parser passed. The workaround is you can save your output to the cache or a physical table.

val df = dfFromFile.cache() df.filter($"_corrupt_record".isNull).drop("_corrupt_record").show() df.filter($"_corrupt_record".isNotNull).select("_corrupt_record").show()

My current suggestion is to capture the empty actual schema and issue an error with a reasonable workaround message. Users can at least know what happened and how to fix the issue.

@dm-tran I think the current change can support your use cases. When only _corrupt_record is selected, it is as the same as the effect of selecting all columns, i.e. a record is recognized as corrupt if any column is in invalid format.

However, it seems be countering the designed behavior that lets _corrupt_record depend on the selected columns, as we discussed in previous comments.

I think @gatorsmile's suggestion should be good.

@gatorsmile @viirya I also think that @gatorsmile's suggestion looks good. Thanks for your replies!

cloud-fan · 2017-08-07T15:53:26Z

cc @HyukjinKwon

SparkQA · 2017-08-07T17:26:56Z

Test build #80350 has finished for PR 18865 at commit 54a1a15.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…lSchema

viirya · 2017-08-08T02:48:40Z

retest this please.

viirya · 2017-08-08T02:53:59Z

@HyukjinKwon @cloud-fan May you help trigger Jenkins? Thanks.

viirya · 2017-08-08T02:54:27Z

Oh. sorry. Looks like the Jenkins run tests now.

SparkQA · 2017-08-08T05:23:11Z

Test build #80375 has finished for PR 18865 at commit 1172584.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-30T18:43:42Z

Could we issue a better error message in such a scenario?

viirya · 2017-08-31T09:19:23Z

I think it makes sense to issue an error with good helpful message when users only select _corrupt_record without other columns.

jmchung · 2017-09-02T01:48:44Z

gentle ping @viirya, @gatorsmile, made a minor change to throw reasonable workaround message.

viirya · 2017-09-02T03:38:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

+      // SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`,
+      // the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows.
+      // When users requires only `_corrupt_record`, we assume that the corrupt records are required
+      // for all json fields, i.g., all items in dataSchema.


This comment is wrong now. We can get rid of this comment as the following exception is self-explained.

viirya · 2017-09-02T03:40:59Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+
+  test("SPARK-21610: Corrupt records are not handled properly when creating a dataframe " +
+    "from a file") {
+    val tempDir = Utils.createTempDir()


We can use withTempPath.

SparkQA · 2017-09-02T04:13:42Z

Test build #81334 has finished for PR 18865 at commit 2decc00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jmchung · 2017-09-02T06:12:44Z

Thanks @viirya's suggestion, the redundant comment is removed and withTempPath is applied in the test case.

SparkQA · 2017-09-02T07:04:47Z

Test build #81339 has finished for PR 18865 at commit 5b0f656.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-02T07:21:19Z

retest this please.

SparkQA · 2017-09-02T09:55:55Z

Test build #81340 has finished for PR 18865 at commit 5b0f656.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-06T13:19:10Z

Test build #81451 has finished for PR 18865 at commit ea5a447.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-08T08:55:40Z

docs/sql-programming-guide.md

+## Upgrading From Spark SQL 2.2 to 2.3
+
+  - The queries which select only `spark.sql.columnNameOfCorruptRecord` column are disallowed now. Notice that the queries which have only the column after column pruning (e.g. filtering on the column followed by a counting operation) are also disallowed. If you want to select only the corrupt records, you should cache or save the Dataset and DataFrame before running such queries.
+


nit: cache or save the underlying Dataset and DataFrame ...

viirya · 2017-09-08T08:57:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

+    if (requiredSchema.length == 1 &&
+      requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) {
+      throw new AnalysisException(
+        s"'${parsedOptions.columnNameOfCorruptRecord}' cannot be selected alone without other " +


This line and the follow ling are concatenated and maybe too long. Add a \n in the end of this line?

viirya · 2017-09-08T08:57:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

+      throw new AnalysisException(
+        s"'${parsedOptions.columnNameOfCorruptRecord}' cannot be selected alone without other " +
+        "data columns, because its content is completely derived from the data columns parsed.\n" +
+        "If you want to select corrupt records only, cache or save the Dataset " +


Add a \n here too?

viirya · 2017-09-08T09:01:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

+      requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) {
+      throw new AnalysisException(
+        s"'${parsedOptions.columnNameOfCorruptRecord}' cannot be selected alone without other " +
+        "data columns, because its content is completely derived from the data columns parsed.\n" +


Add one more sentence like Even your queries looks not only select this column, if after column pruning it isn't involving paring any data fields, e.g., filtering on the column followed by a counting, it can produce incorrect results and so disallowed.

viirya · 2017-09-08T09:02:45Z

Leave more few comments on the message. Otherwise LGTM.

jmchung · 2017-09-08T14:27:11Z

@viirya, thank you so much for taking a look and your time.

HyukjinKwon · 2017-09-08T14:35:37Z

Seems fine to me too.

jmchung · 2017-09-08T14:41:37Z

Thank you for review, @HyukjinKwon.

cc @gatorsmile
Could you review this again when you have sometime? thanks!

SparkQA · 2017-09-08T17:03:39Z

Test build #81561 has finished for PR 18865 at commit f7d5a5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-09T04:14:26Z

docs/sql-programming-guide.md


+## Upgrading From Spark SQL 2.2 to 2.3
+
+  - The queries which select only `spark.sql.columnNameOfCorruptRecord` column are disallowed now. Notice that the queries which have only the column after column pruning (e.g. filtering on the column followed by a counting operation) are also disallowed. If you want to select only the corrupt records, you should cache or save the underlying Dataset and DataFrame before running such queries.


Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named _corrupt_column by default). For example, spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count() and spark.read.schema(schema).json(file).select("_corrupt_record").show(). Instead, you can cache or save the parsed results and then send the same query. For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

gatorsmile · 2017-09-09T04:16:14Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

    }
  }
+
+  test("SPARK-21610: Corrupt records are not handled properly when creating a dataframe " +


Could you include both negative cases I posted above?

Also include the workaround in the test case? It can ensure the future code changes will not break it.

Sure, I'll update the PR.

jmchung · 2017-09-09T15:17:21Z

@gatorsmile those negative cases and workaround are already added in JsonSuite.

gatorsmile · 2017-09-09T16:33:36Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+      assert(msg.contains(expectedErrorMsg))
+      // negative cases
+      msg = intercept[AnalysisException] {
+        spark.read.schema(schema).json(path).select("_corrupt_record").show()


You already have the one using collect(). No need to do it here.

gatorsmile · 2017-09-09T16:33:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+        spark.read.schema(schema).json(path).select("_corrupt_record").collect()
+      }.getMessage
+      assert(msg.contains(expectedErrorMsg))
+      // negative cases


move this to line 2049. Thanks!

gatorsmile · 2017-09-09T16:35:31Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+      // workaround
+      val df = spark.read.schema(schema).json(path).cache()
+      assert(df.filter($"_corrupt_record".isNotNull).count() == 1)
+      assert(df.filter($"_corrupt_record".isNull).count() == 2)


Please also add another one

checkAnswer( spark.read.schema(schema).json(path).select("_corrupt_record"), Row(....

gatorsmile · 2017-09-09T16:38:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

+        "If you want to select corrupt records only, cache or save the Dataset\n" +
+        "before executing queries, as this parses all fields under the hood. For example: \n" +
+        "df.cache()\n" +
+        s"""df.select("${parsedOptions.columnNameOfCorruptRecord}")"""


How about also improving this based on the one we changed in sql-programming-guide.md? Thanks!

SparkQA · 2017-09-09T17:48:51Z

Test build #81591 has finished for PR 18865 at commit 3fc83f6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-10T13:17:58Z

Test build #81603 has finished for PR 18865 at commit 643ddf9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jmchung · 2017-09-10T13:34:47Z

cc @gatorsmile Please take another look when you have time. I've already updated. Thanks!

gatorsmile · 2017-09-11T00:27:21Z

Thanks! Merged to master.

@jmchung Could you submit a follow-up PR for CSV? Thanks!

viirya · 2017-09-11T02:07:40Z

Thanks @HyukjinKwon @gatorsmile @cloud-fan @dm-tran

jmchung · 2017-09-11T02:12:27Z

@gatorsmile Sure, I'll make a follow-up PR for CSV.
Great thanks for everyone's feedback in this patch, I really learned a lot from it.

jmchung added 5 commits August 7, 2017 11:52

[SPARK-21610][SQL] Corrupt records are not handled properly when crea…

09aa76c

…ting a dataframe from a file

add explanation to schema change and minor refactor in test case

f73c387

move test case from DataFrameReaderWriterSuite to JsonSuite

7a59598

filter not _corrupt_record in dataSchema

97290f0

code refactor

f5eec40

jmchung closed this Aug 7, 2017

jmchung reopened this Aug 7, 2017

code alignment

54a1a15

cloud-fan reviewed Aug 7, 2017

View reviewed changes

if actualSchema and requireSchema are empty, set querySchema to actua…

1172584

…lSchema

Throw reasonable workaround message

2decc00

viirya reviewed Sep 2, 2017

View reviewed changes

remove redundant comment and apply withTempPath in test case

5b0f656

update migration guide on sql-programming-guide.md

ea5a447

viirya reviewed Sep 8, 2017

View reviewed changes

add more details on exception

f7d5a5f

gatorsmile reviewed Sep 9, 2017

View reviewed changes

add negative cases and modify the migration guide

3fc83f6

gatorsmile reviewed Sep 9, 2017

View reviewed changes

revise the test case and the message of AnalysisException

643ddf9

asfgit closed this in 6273a71 Sep 11, 2017

		## Upgrading From Spark SQL 2.2 to 2.3

		- The queries which select only `spark.sql.columnNameOfCorruptRecord` column are disallowed now. Notice that the queries which have only the column after column pruning (e.g. filtering on the column followed by a counting operation) are also disallowed. If you want to select only the corrupt records, you should cache or save the Dataset and DataFrame before running such queries.

[SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file #18865

[SPARK-21610][SQL] Corrupt records are not handled properly when creating a dataframe from a file #18865

Uh oh!

Conversation

jmchung commented Aug 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Aug 7, 2017

Uh oh!

cloud-fan commented Aug 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Aug 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 7, 2017

Uh oh!

SparkQA commented Aug 7, 2017

Uh oh!

viirya commented Aug 8, 2017

Uh oh!

viirya commented Aug 8, 2017

Uh oh!

viirya commented Aug 8, 2017

Uh oh!

SparkQA commented Aug 8, 2017

Uh oh!

gatorsmile commented Aug 30, 2017

Uh oh!

viirya commented Aug 31, 2017

Uh oh!

jmchung commented Sep 2, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 2, 2017

Uh oh!

jmchung commented Sep 2, 2017

Uh oh!

SparkQA commented Sep 2, 2017

Uh oh!

viirya commented Sep 2, 2017

Uh oh!

SparkQA commented Sep 2, 2017

Uh oh!

SparkQA commented Sep 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Sep 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Sep 8, 2017

jmchung commented Aug 7, 2017 •

edited

Loading

viirya Aug 8, 2017 •

edited

Loading

viirya Sep 8, 2017 •

edited

Loading

jmchung commented Sep 9, 2017 •

edited

Loading

jmchung commented Sep 11, 2017 •

edited

Loading