-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24329][SQL] Remove comments filtering before parsing of CSV files #21380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #90889 has finished for PR 21380 at commit
|
| def getPartialResult(): Option[InternalRow] = { | ||
| try { | ||
| Some(convert(checkedTokens)) | ||
| convert(checkedTokens).headOption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk is this change related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is. I changed returned type of the convert() method from InternalRow to Seq[InternalRow] to catch the cases when uniVocity parser returns nulls (comments and empty lines). As a consequence of that, I have to change this function too because it returns Option which is required by the BadRecordException exception. It is safe because Seq can be either empty or contain only one element. And I though it is better to modify body of getPartialResult() than places where BadRecordException is handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk Maybe add this in PR description
| } | ||
|
|
||
| val filteredLines: Iterator[String] = | ||
| CSVUtils.filterCommentAndEmpty(linesWithoutHeader, options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel sure I put this because Univocity has an issue about this before IIRC. Wouldn't we better just keep this just in case? I think we already do such things in Spark side redundantly to make sure in few places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we have no harm to put this. btw, univocity still has the issue in v2.6.3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fixed as the tests pass now. I hit a test failure if I remember this correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, you observed issues in old versions of uniVocity parser as @maropu wrote above. I would propose to remove the filtering till we face to the cases when uniVocity's filter doesn't work as it is expected. So, we would submit an issue for uniVocity and revert the changes back.
I think we already do such things in Spark side redundantly to make sure in few places.
I looked at another places where we do the same but this is only the place where we do filtering directly before uniVocity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, for example, we do both Parquet's record-level filter and Spark's filter although they are pushed down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, It doesn't always mean that we have tests. Because it was there from the first place and I tried to remove it, then the tests were broken. I expected to be broken again but seems passed now. So, I'm just guessing that it's fixed.
Usually we trust but we should be careful if there were some issues found. I don't think we should make this case special. I am not seeing meaningful improvement either.
One nit is, BTW, the purpose of ignoreLeadingWhiteSpaceInRead and ignoreTrailingWhiteSpaceInRead are basically for trimming the whitespaces in the values not for skipping empty lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need strong evidence/test cases to make sure the current uniVocity filtering works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote a test in the PR: #21394 which is passed on the current implementation but fails on this PR. After this PR, lines with multiple whitespaces are not ignored. To ignore such lines, need to set ignoreLeadingWhiteSpace to true. See https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L106-L110
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so now it actually fixes an issue, right? will take a look soon. BTW, I think you can fold the changes in #21394 into here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the test from #21394 shows the case when this PR has different behavior: empty lines consist of multiple whitespaces + ignoreLeadingWhiteSpace is false (which is by default) produces nulls. UniVocity parser can ignore lines with multiple whitespaces only when ignoreLeadingWhiteSpace (or ignoreLeadingWhiteSpace) is set to true.
So, there is no combination of CSV options that allow to have default behavior of current implementation. I would like to propose to close this PR and add the test from #21394 to CSVSuite to be sure we will not break the behavior described above.
What changes were proposed in this pull request?
Filtering of comments and whitespace has been performed by uniVocity parser already according to parser settings:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180
There is no need to repeat the same before uniVocity parser. In this PR, I propose to remove filtering of whitespaces and comments (call of filterCommentAndEmpty) in the parseIterator method of the UnivocityParser object.
How was this patch tested?
The changes were tested by CSVSuite