Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented May 21, 2018

What changes were proposed in this pull request?

Filtering of comments and whitespace has been performed by uniVocity parser already according to parser settings:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180

There is no need to repeat the same before uniVocity parser. In this PR, I propose to remove filtering of whitespaces and comments (call of filterCommentAndEmpty) in the parseIterator method of the UnivocityParser object.

How was this patch tested?

The changes were tested by CSVSuite

@SparkQA
Copy link

SparkQA commented May 21, 2018

Test build #90889 has finished for PR 21380 at commit 3652268.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def getPartialResult(): Option[InternalRow] = {
try {
Some(convert(checkedTokens))
convert(checkedTokens).headOption
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk is this change related?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. I changed returned type of the convert() method from InternalRow to Seq[InternalRow] to catch the cases when uniVocity parser returns nulls (comments and empty lines). As a consequence of that, I have to change this function too because it returns Option which is required by the BadRecordException exception. It is safe because Seq can be either empty or contain only one element. And I though it is better to modify body of getPartialResult() than places where BadRecordException is handled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk Maybe add this in PR description

}

val filteredLines: Iterator[String] =
CSVUtils.filterCommentAndEmpty(linesWithoutHeader, options)
Copy link
Member

@HyukjinKwon HyukjinKwon May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel sure I put this because Univocity has an issue about this before IIRC. Wouldn't we better just keep this just in case? I think we already do such things in Spark side redundantly to make sure in few places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we have no harm to put this. btw, univocity still has the issue in v2.6.3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fixed as the tests pass now. I hit a test failure if I remember this correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, you observed issues in old versions of uniVocity parser as @maropu wrote above. I would propose to remove the filtering till we face to the cases when uniVocity's filter doesn't work as it is expected. So, we would submit an issue for uniVocity and revert the changes back.

I think we already do such things in Spark side redundantly to make sure in few places.

I looked at another places where we do the same but this is only the place where we do filtering directly before uniVocity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, for example, we do both Parquet's record-level filter and Spark's filter although they are pushed down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, It doesn't always mean that we have tests. Because it was there from the first place and I tried to remove it, then the tests were broken. I expected to be broken again but seems passed now. So, I'm just guessing that it's fixed.

Usually we trust but we should be careful if there were some issues found. I don't think we should make this case special. I am not seeing meaningful improvement either.

One nit is, BTW, the purpose of ignoreLeadingWhiteSpaceInRead and ignoreTrailingWhiteSpaceInRead are basically for trimming the whitespaces in the values not for skipping empty lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need strong evidence/test cases to make sure the current uniVocity filtering works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a test in the PR: #21394 which is passed on the current implementation but fails on this PR. After this PR, lines with multiple whitespaces are not ignored. To ignore such lines, need to set ignoreLeadingWhiteSpace to true. See https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L106-L110

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so now it actually fixes an issue, right? will take a look soon. BTW, I think you can fold the changes in #21394 into here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the test from #21394 shows the case when this PR has different behavior: empty lines consist of multiple whitespaces + ignoreLeadingWhiteSpace is false (which is by default) produces nulls. UniVocity parser can ignore lines with multiple whitespaces only when ignoreLeadingWhiteSpace (or ignoreLeadingWhiteSpace) is set to true.

So, there is no combination of CSV options that allow to have default behavior of current implementation. I would like to propose to close this PR and add the test from #21394 to CSVSuite to be sure we will not break the behavior described above.

@asfgit asfgit closed this in 13bedc0 May 24, 2018
@MaxGekk MaxGekk deleted the delete-comment-filtering branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants