-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-30530][SQL] Fix filter pushdown for bad CSV records #27239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #116867 has finished for PR 27239 at commit
|
| while (i < requiredSchema.length) { | ||
| try { | ||
| if (!skipRow) { | ||
| row(i) = valueConverters(i).apply(getToken(tokens, i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the first column is corrupted, and the predicate is first_col is null, what will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 3 cases:
- Univocity parser is not able to parse its input. For example, it faced to wrong Unicode symbol. In that case, it return
nullintokens, andBadRecordExceptionwill be raised herespark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Lines 229 to 232 in 4e50f02
throw BadRecordException( () => getCurrentInput, () => None, new RuntimeException("Malformed CSV record")) - Univocity parser returns
nullin the first token. In this case, we will try to convertnullto desired type according torequiredSchema. Most likely, the conversion raises an exception which is will be convertedBadRecordExceptionhereand herespark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Line 277 in 4e50f02
badRecordException = badRecordException.orElse(Some(e)) spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Line 286 in 4e50f02
throw BadRecordException(
2.1 If conversion doesn't fail, theis nullfilter will be applied to the value and row could be passed to upper layer. - Univocity parser returns a valid string at index 0 in
tokensbut conversion fails atwith some exception. Similar situation to 2. The exception will be handled, and transformed tospark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
Line 267 in 4e50f02
row(i) = valueConverters(i).apply(getToken(tokens, i)) BadRecordException.
New implementation with filters pushdown does not change the behavior in those cases.
|
thanks, merging to master! |
| // However, we still have chance to parse some of the tokens, by adding extra null tokens in | ||
| // the tail if the number is smaller, or by dropping extra tokens if the number is larger. | ||
| val checkedTokens = if (parsedSchema.length > tokens.length) { | ||
| checkedTokens = if (parsedSchema.length > tokens.length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this checkedTokens now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems not. The if can be replaced by:
var badRecordException: Option[Throwable] = if (tokens.length != parsedSchema.length) {
// If the number of tokens doesn't match the schema, we should treat it as a malformed record.
Some(new RuntimeException("Malformed CSV record"))
} else NoneLet me do that in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I already did at #27287. Let me address this comment there.
| var skipRow = false | ||
| while (i < requiredSchema.length) { | ||
| try { | ||
| if (!skipRow) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
if (skipRow) {
row.setNullAt(i)
} else {
row(i) = valueConverters(i).apply(getToken(tokens, i))
if (csvFilters.skipRow(row, i)) {
skipRow = true
}
}| row.setNullAt(i) | ||
| } | ||
| } catch { | ||
| case NonFatal(e) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we rely on nulls already exiting in the array. Now we rely on java.lang.ArrayIndexOutOfBoundsException. I don't particularly like this approach .. but I'm good as it does simplify the codes.
What changes were proposed in this pull request?
In the PR, I propose to fix the bug reported in SPARK-30530. CSV datasource returns invalid records in the case when
parsedSchemais shorter than number of tokens returned by UniVocity parser. In the caseUnivocityParser.convert()always throwsBadRecordExceptionindependently from the result of applying filters.For the described case, I propose to save the exception in
badRecordExceptionand continue value conversion according toparsedSchema. If a bad record doesn't pass filters,convert()returns empty Seq otherwise throwsbadRecordException.Why are the changes needed?
It fixes the bug reported in the JIRA ticket.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added new test from the JIRA ticket.