[SPARK-30530][SQL] Fix filter pushdown for bad CSV records #27239

MaxGekk · 2020-01-16T19:44:03Z

What changes were proposed in this pull request?

In the PR, I propose to fix the bug reported in SPARK-30530. CSV datasource returns invalid records in the case when parsedSchema is shorter than number of tokens returned by UniVocity parser. In the case UnivocityParser.convert() always throws BadRecordException independently from the result of applying filters.

For the described case, I propose to save the exception in badRecordException and continue value conversion according to parsedSchema. If a bad record doesn't pass filters, convert() returns empty Seq otherwise throws badRecordException.

Why are the changes needed?

It fixes the bug reported in the JIRA ticket.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added new test from the JIRA ticket.

SparkQA · 2020-01-16T23:43:08Z

Test build #116867 has finished for PR 27239 at commit fd48094.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-17T13:18:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

+    while (i < requiredSchema.length) {
+      try {
+        if (!skipRow) {
+          row(i) = valueConverters(i).apply(getToken(tokens, i))


if the first column is corrupted, and the predicate is first_col is null, what will happen?

There are 3 cases:

Univocity parser is not able to parse its input. For example, it faced to wrong Unicode symbol. In that case, it return null in tokens, and BadRecordException will be raised here

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

Lines 229 to 232 in 4e50f02

throw BadRecordException(

() => getCurrentInput,

() => None,

new RuntimeException("Malformed CSV record"))

Univocity parser returns null in the first token. In this case, we will try to convert null to desired type according to requiredSchema. Most likely, the conversion raises an exception which is will be converted BadRecordException here

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

Line 277 in 4e50f02

badRecordException = badRecordException.orElse(Some(e))

and here

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

Line 286 in 4e50f02

throw BadRecordException(

2.1 If conversion doesn't fail, the is null filter will be applied to the value and row could be passed to upper layer.

Univocity parser returns a valid string at index 0 in tokens but conversion fails at

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

Line 267 in 4e50f02

row(i) = valueConverters(i).apply(getToken(tokens, i))

with some exception. Similar situation to 2. The exception will be handled, and transformed to BadRecordException.

New implementation with filters pushdown does not change the behavior in those cases.

cloud-fan · 2020-01-19T09:22:52Z

thanks, merging to master!

HyukjinKwon · 2020-01-20T05:32:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

      // However, we still have chance to parse some of the tokens, by adding extra null tokens in
      // the tail if the number is smaller, or by dropping extra tokens if the number is larger.
-      val checkedTokens = if (parsedSchema.length > tokens.length) {
+      checkedTokens = if (parsedSchema.length > tokens.length) {


Do we need this checkedTokens now?

It seems not. The if can be replaced by:

var badRecordException: Option[Throwable] = if (tokens.length != parsedSchema.length) { // If the number of tokens doesn't match the schema, we should treat it as a malformed record. Some(new RuntimeException("Malformed CSV record")) } else None

Let me do that in a follow up PR.

Oops, I already did at #27287. Let me address this comment there.

HyukjinKwon · 2020-01-20T05:36:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

+    var skipRow = false
+    while (i < requiredSchema.length) {
+      try {
+        if (!skipRow) {


nit:

if (skipRow) { row.setNullAt(i) } else { row(i) = valueConverters(i).apply(getToken(tokens, i)) if (csvFilters.skipRow(row, i)) { skipRow = true } }

HyukjinKwon · 2020-01-20T06:03:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

+          row.setNullAt(i)
+        }
+      } catch {
+        case NonFatal(e) =>


Previously we rely on nulls already exiting in the array. Now we rely on java.lang.ArrayIndexOutOfBoundsException. I don't particularly like this approach .. but I'm good as it does simplify the codes.

MaxGekk added 4 commits January 16, 2020 21:52

Add a test

49daf9b

Use checkAnswer() in the test

5d2e90e

Bug fix

95487fb

Refactor the test

fd48094

MaxGekk requested review from HyukjinKwon and cloud-fan January 16, 2020 19:58

cloud-fan reviewed Jan 17, 2020

View reviewed changes

cloud-fan closed this in d4c6ec6 Jan 19, 2020

HyukjinKwon reviewed Jan 20, 2020

View reviewed changes

MaxGekk mentioned this pull request Jan 27, 2020

[SPARK-30648][SQL] Support filters pushdown in JSON datasource #27366

Closed

dongjoon-hyun added the SQL label Feb 5, 2020

MaxGekk deleted the spark-30530-csv-filter-is-null branch June 5, 2020 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-30530][SQL] Fix filter pushdown for bad CSV records #27239

[SPARK-30530][SQL] Fix filter pushdown for bad CSV records #27239

Uh oh!

MaxGekk commented Jan 16, 2020

Uh oh!

SparkQA commented Jan 16, 2020

Uh oh!

cloud-fan Jan 17, 2020

Uh oh!

MaxGekk Jan 17, 2020

Uh oh!

cloud-fan commented Jan 19, 2020

Uh oh!

HyukjinKwon Jan 20, 2020

Uh oh!

MaxGekk Jan 20, 2020

Uh oh!

HyukjinKwon Jan 20, 2020

Uh oh!

HyukjinKwon Jan 20, 2020

Uh oh!

HyukjinKwon Jan 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	throw BadRecordException(
	() => getCurrentInput,
	() => None,
	new RuntimeException("Malformed CSV record"))

[SPARK-30530][SQL] Fix filter pushdown for bad CSV records #27239

[SPARK-30530][SQL] Fix filter pushdown for bad CSV records #27239

Uh oh!

Conversation

MaxGekk commented Jan 16, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jan 16, 2020

Uh oh!

cloud-fan Jan 17, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jan 17, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 19, 2020

Uh oh!

HyukjinKwon Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 20, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants