Skip to content

Conversation

@jmchung
Copy link
Contributor

@jmchung jmchung commented Aug 7, 2017

What changes were proposed in this pull request?

echo '{"field": 1}
{"field": 2}
{"field": "3"}' >/tmp/sample.json
import org.apache.spark.sql.types._

val schema = new StructType()
  .add("field", ByteType)
  .add("_corrupt_record", StringType)

val file = "/tmp/sample.json"

val dfFromFile = spark.read.schema(schema).json(file)

scala> dfFromFile.show(false)
+-----+---------------+
|field|_corrupt_record|
+-----+---------------+
|1    |null           |
|2    |null           |
|null |{"field": "3"} |
+-----+---------------+

scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
res1: Long = 0

scala> dfFromFile.filter($"_corrupt_record".isNull).count()
res2: Long = 3

When the requiredSchema only contains _corrupt_record, the derived actualSchema is empty and the _corrupt_record are all null for all rows. This PR captures above situation and raise an exception with a reasonable workaround messag so that users can know what happened and how to fix the query.

How was this patch tested?

Added test case.

@jmchung jmchung closed this Aug 7, 2017
@jmchung jmchung reopened this Aug 7, 2017
@viirya
Copy link
Member

viirya commented Aug 7, 2017

cc @gatorsmile @cloud-fan Can you help trigger Jenkins for this? Thanks.

@cloud-fan
Copy link
Contributor

ok to test


(file: PartitionedFile) => {
val parser = new JacksonParser(actualSchema, parsedOptions)
// SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this bug apply for csv too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. But he is a beginner I'm mentoring to contribute to Spark. So we will keep this change focusing on Json. We may deal with csv later. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I understood it fixes the issue described in the JIRA but won't this introduce casting tries for all columns when the requested schema is empty? I think this one is a rather band-aid fix. I mean, I think if the actualSchema has few columns selected from many columns, it'd introduce similar problem again ..

Copy link
Member

@viirya viirya Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two issues:

  1. When required schema is empty. This is why the jenkins test fails. We're working on fix it.

  2. If actualSchema selects few columns. We noticed that. The column _corrupted_record is different when the selected columns are different. But correctly we treat it as a designed behavior. Not sure if this is an issue we need to fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column _corrupted_record is different when the selected columns are different

If _corrupted_record is designed to have different values for different selected columns, it may makes sense to set _corrupted_record to null if no columns are selected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agreed. With current behavior, it is unavoidable to have some strange queries with _corrupted_record.

I'd suggest as #18865 (comment), we should document _corrupted_record in CSV, JSON data sources is a derived column and can be incorrect if not used with other columns.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to let users know that _corrupted_record is a derived column from other columns and cannot be selected alone in a query.

@viirya I created issue https://issues.apache.org/jira/browse/SPARK-21610 and need to select field "_corrupt_record" alone. This is possible with spark 2.2 (if a dataframe is created from a RDD) and it would be great to keep this behaviour in future versions of Spark.

My use case is the following one: a spark job reads JSON with an input schema, and will:

  • save records that match the input schema in parquet format
  • save "corrupt records" (invalid JSON or records that do match the input schema) to text files, in a separate folder.

Basically, I want :

  • a folder with clean data in parquet format
  • another folder with "corrupt records". I can then analyze corrupt records and for instance tell partners that they are sending invalid data. This enables a clean data pipeline that separates valid records from corrupt records

To get valid and corrupt records, I write :

      val validRecords = df.filter(col("_corrupt_record").isNull)
        .drop("_corrupt_record")

      val corruptRecords = df.filter(col("_corrupt_record").isNotNull)
        .select("_corrupt_record")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your usage scenario makes sense to me. The contents of _corrupt_record depends on the fields our parser passed. The workaround is you can save your output to the cache or a physical table.

    val df = dfFromFile.cache()
    df.filter($"_corrupt_record".isNull).drop("_corrupt_record").show()
    df.filter($"_corrupt_record".isNotNull).select("_corrupt_record").show()

My current suggestion is to capture the empty actual schema and issue an error with a reasonable workaround message. Users can at least know what happened and how to fix the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dm-tran I think the current change can support your use cases. When only _corrupt_record is selected, it is as the same as the effect of selecting all columns, i.e. a record is recognized as corrupt if any column is in invalid format.

However, it seems be countering the designed behavior that lets _corrupt_record depend on the selected columns, as we discussed in previous comments.

I think @gatorsmile's suggestion should be good.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile @viirya I also think that @gatorsmile's suggestion looks good. Thanks for your replies!

@cloud-fan
Copy link
Contributor

cc @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Aug 7, 2017

Test build #80350 has finished for PR 18865 at commit 54a1a15.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Aug 8, 2017

retest this please.

@viirya
Copy link
Member

viirya commented Aug 8, 2017

@HyukjinKwon @cloud-fan May you help trigger Jenkins? Thanks.

@viirya
Copy link
Member

viirya commented Aug 8, 2017

Oh. sorry. Looks like the Jenkins run tests now.

@SparkQA
Copy link

SparkQA commented Aug 8, 2017

Test build #80375 has finished for PR 18865 at commit 1172584.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Could we issue a better error message in such a scenario?

@viirya
Copy link
Member

viirya commented Aug 31, 2017

I think it makes sense to issue an error with good helpful message when users only select _corrupt_record without other columns.

@jmchung
Copy link
Contributor Author

jmchung commented Sep 2, 2017

gentle ping @viirya, @gatorsmile, made a minor change to throw reasonable workaround message.

// SPARK-21610: when the `requiredSchema` only contains `_corrupt_record`,
// the derived `actualSchema` is empty and the `_corrupt_record` are all null for all rows.
// When users requires only `_corrupt_record`, we assume that the corrupt records are required
// for all json fields, i.g., all items in dataSchema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is wrong now. We can get rid of this comment as the following exception is self-explained.


test("SPARK-21610: Corrupt records are not handled properly when creating a dataframe " +
"from a file") {
val tempDir = Utils.createTempDir()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use withTempPath.

@SparkQA
Copy link

SparkQA commented Sep 2, 2017

Test build #81334 has finished for PR 18865 at commit 2decc00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jmchung
Copy link
Contributor Author

jmchung commented Sep 2, 2017

Thanks @viirya's suggestion, the redundant comment is removed and withTempPath is applied in the test case.

@SparkQA
Copy link

SparkQA commented Sep 2, 2017

Test build #81339 has finished for PR 18865 at commit 5b0f656.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Sep 2, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 2, 2017

Test build #81340 has finished for PR 18865 at commit 5b0f656.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81451 has finished for PR 18865 at commit ea5a447.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

## Upgrading From Spark SQL 2.2 to 2.3

- The queries which select only `spark.sql.columnNameOfCorruptRecord` column are disallowed now. Notice that the queries which have only the column after column pruning (e.g. filtering on the column followed by a counting operation) are also disallowed. If you want to select only the corrupt records, you should cache or save the Dataset and DataFrame before running such queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cache or save the underlying Dataset and DataFrame ...

if (requiredSchema.length == 1 &&
requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) {
throw new AnalysisException(
s"'${parsedOptions.columnNameOfCorruptRecord}' cannot be selected alone without other " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line and the follow ling are concatenated and maybe too long. Add a \n in the end of this line?

throw new AnalysisException(
s"'${parsedOptions.columnNameOfCorruptRecord}' cannot be selected alone without other " +
"data columns, because its content is completely derived from the data columns parsed.\n" +
"If you want to select corrupt records only, cache or save the Dataset " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a \n here too?

requiredSchema.head.name == parsedOptions.columnNameOfCorruptRecord) {
throw new AnalysisException(
s"'${parsedOptions.columnNameOfCorruptRecord}' cannot be selected alone without other " +
"data columns, because its content is completely derived from the data columns parsed.\n" +
Copy link
Member

@viirya viirya Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add one more sentence like Even your queries looks not only select this column, if after column pruning it isn't involving paring any data fields, e.g., filtering on the column followed by a counting, it can produce incorrect results and so disallowed.

@viirya
Copy link
Member

viirya commented Sep 8, 2017

Leave more few comments on the message. Otherwise LGTM.

@jmchung
Copy link
Contributor Author

jmchung commented Sep 8, 2017

@viirya, thank you so much for taking a look and your time.

@HyukjinKwon
Copy link
Member

Seems fine to me too.

@jmchung
Copy link
Contributor Author

jmchung commented Sep 8, 2017

Thank you for review, @HyukjinKwon.

cc @gatorsmile
Could you review this again when you have sometime? thanks!

@SparkQA
Copy link

SparkQA commented Sep 8, 2017

Test build #81561 has finished for PR 18865 at commit f7d5a5f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


## Upgrading From Spark SQL 2.2 to 2.3

- The queries which select only `spark.sql.columnNameOfCorruptRecord` column are disallowed now. Notice that the queries which have only the column after column pruning (e.g. filtering on the column followed by a counting operation) are also disallowed. If you want to select only the corrupt records, you should cache or save the underlying Dataset and DataFrame before running such queries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named _corrupt_column by default). For example, spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count() and spark.read.schema(schema).json(file).select("_corrupt_record").show(). Instead, you can cache or save the parsed results and then send the same query. For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

}
}

test("SPARK-21610: Corrupt records are not handled properly when creating a dataframe " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include both negative cases I posted above?

Also include the workaround in the test case? It can ensure the future code changes will not break it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll update the PR.

@jmchung
Copy link
Contributor Author

jmchung commented Sep 9, 2017

@gatorsmile those negative cases and workaround are already added in JsonSuite.

assert(msg.contains(expectedErrorMsg))
// negative cases
msg = intercept[AnalysisException] {
spark.read.schema(schema).json(path).select("_corrupt_record").show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have the one using collect(). No need to do it here.

spark.read.schema(schema).json(path).select("_corrupt_record").collect()
}.getMessage
assert(msg.contains(expectedErrorMsg))
// negative cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to line 2049. Thanks!

// workaround
val df = spark.read.schema(schema).json(path).cache()
assert(df.filter($"_corrupt_record".isNotNull).count() == 1)
assert(df.filter($"_corrupt_record".isNull).count() == 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add another one

    checkAnswer(
      spark.read.schema(schema).json(path).select("_corrupt_record"),
      Row(....

"If you want to select corrupt records only, cache or save the Dataset\n" +
"before executing queries, as this parses all fields under the hood. For example: \n" +
"df.cache()\n" +
s"""df.select("${parsedOptions.columnNameOfCorruptRecord}")"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about also improving this based on the one we changed in sql-programming-guide.md? Thanks!

@SparkQA
Copy link

SparkQA commented Sep 9, 2017

Test build #81591 has finished for PR 18865 at commit 3fc83f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2017

Test build #81603 has finished for PR 18865 at commit 643ddf9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jmchung
Copy link
Contributor Author

jmchung commented Sep 10, 2017

cc @gatorsmile Please take another look when you have time. I've already updated. Thanks!

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@jmchung Could you submit a follow-up PR for CSV? Thanks!

@asfgit asfgit closed this in 6273a71 Sep 11, 2017
@viirya
Copy link
Member

viirya commented Sep 11, 2017

@jmchung
Copy link
Contributor Author

jmchung commented Sep 11, 2017

@gatorsmile Sure, I'll make a follow-up PR for CSV.
Great thanks for everyone's feedback in this patch, I really learned a lot from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants