[SPARK-24329][SQL] Remove comments filtering before parsing of CSV files #21380

MaxGekk · 2018-05-21T08:29:20Z

What changes were proposed in this pull request?

Filtering of comments and whitespace has been performed by uniVocity parser already according to parser settings:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L178-L180

There is no need to repeat the same before uniVocity parser. In this PR, I propose to remove filtering of whitespaces and comments (call of filterCommentAndEmpty) in the parseIterator method of the UnivocityParser object.

How was this patch tested?

The changes were tested by CSVSuite

SparkQA · 2018-05-21T11:53:39Z

Test build #90889 has finished for PR 21380 at commit 3652268.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-05-22T01:55:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

      def getPartialResult(): Option[InternalRow] = {
        try {
-          Some(convert(checkedTokens))
+          convert(checkedTokens).headOption


@MaxGekk is this change related?

Yes, it is. I changed returned type of the convert() method from InternalRow to Seq[InternalRow] to catch the cases when uniVocity parser returns nulls (comments and empty lines). As a consequence of that, I have to change this function too because it returns Option which is required by the BadRecordException exception. It is safe because Seq can be either empty or contain only one element. And I though it is better to modify body of getPartialResult() than places where BadRecordException is handled.

@MaxGekk Maybe add this in PR description

HyukjinKwon · 2018-05-22T02:00:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

    }

-    val filteredLines: Iterator[String] =
-      CSVUtils.filterCommentAndEmpty(linesWithoutHeader, options)


I feel sure I put this because Univocity has an issue about this before IIRC. Wouldn't we better just keep this just in case? I think we already do such things in Spark side redundantly to make sure in few places.

I also think we have no harm to put this. btw, univocity still has the issue in v2.6.3?

Seems fixed as the tests pass now. I hit a test failure if I remember this correctly.

Probably, you observed issues in old versions of uniVocity parser as @maropu wrote above. I would propose to remove the filtering till we face to the cases when uniVocity's filter doesn't work as it is expected. So, we would submit an issue for uniVocity and revert the changes back.

I think we already do such things in Spark side redundantly to make sure in few places.

I looked at another places where we do the same but this is only the place where we do filtering directly before uniVocity.

I mean, for example, we do both Parquet's record-level filter and Spark's filter although they are pushed down.

@MaxGekk, It doesn't always mean that we have tests. Because it was there from the first place and I tried to remove it, then the tests were broken. I expected to be broken again but seems passed now. So, I'm just guessing that it's fixed.

Usually we trust but we should be careful if there were some issues found. I don't think we should make this case special. I am not seeing meaningful improvement either.

One nit is, BTW, the purpose of ignoreLeadingWhiteSpaceInRead and ignoreTrailingWhiteSpaceInRead are basically for trimming the whitespaces in the values not for skipping empty lines.

We need strong evidence/test cases to make sure the current uniVocity filtering works.

I wrote a test in the PR: #21394 which is passed on the current implementation but fails on this PR. After this PR, lines with multiple whitespaces are not ignored. To ignore such lines, need to set ignoreLeadingWhiteSpace to true. See https://github.com/uniVocity/univocity-parsers/blob/v2.6.3/src/main/java/com/univocity/parsers/csv/CsvParser.java#L106-L110

Oh, so now it actually fixes an issue, right? will take a look soon. BTW, I think you can fold the changes in #21394 into here.

Actually the test from #21394 shows the case when this PR has different behavior: empty lines consist of multiple whitespaces + ignoreLeadingWhiteSpace is false (which is by default) produces nulls. UniVocity parser can ignore lines with multiple whitespaces only when ignoreLeadingWhiteSpace (or ignoreLeadingWhiteSpace) is set to true.

So, there is no combination of CSV options that allow to have default behavior of current implementation. I would like to propose to close this PR and add the test from #21394 to CSVSuite to be sure we will not break the behavior described above.

MaxGekk added 2 commits May 20, 2018 23:16

Removing of unnecessary comments filtering.

014cfe9

Cleanup remove of comments

3652268

HyukjinKwon reviewed May 22, 2018

View reviewed changes

MaxGekk mentioned this pull request May 22, 2018

[SPARK-24329][SQL] Test for skipping multi-space lines #21394

Closed

asfgit closed this in 13bedc0 May 24, 2018

MaxGekk deleted the delete-comment-filtering branch August 17, 2019 13:33

[SPARK-24329][SQL] Remove comments filtering before parsing of CSV files #21380

[SPARK-24329][SQL] Remove comments filtering before parsing of CSV files #21380

Uh oh!

Conversation

MaxGekk commented May 21, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon May 22, 2018 •

edited

Loading