Skip to content

Conversation

@justinuang
Copy link

@justinuang justinuang commented Sep 20, 2018

What changes were proposed in this pull request?

CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode.

How was this patch tested?

Unit test with a file with crlf line endings.

settings.setEmptyValue(emptyValueInRead)
settings.setMaxCharsPerColumn(maxCharsPerColumn)
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
settings.setLineSeparatorDetectionEnabled(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto-detection mechanism is enabled for both - multi-line and per-line mode. I guess it has some overhead on detection of new lines which is not needed in per-line mode. I would benchmark it in both modes (see CSVBenchmarks), and if the overhead in per-line mode is significant, I would not enable the option when multiLine is set to false.

settings.setEmptyValue(emptyValueInRead)
settings.setMaxCharsPerColumn(maxCharsPerColumn)
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
settings.setLineSeparatorDetectionEnabled(true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I would rather enable this only for multiline mode. Also, please add what this configuration does in the PR description.

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

Also, please fix the PR title to be more descriptive. For instance, [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode.

@SparkQA
Copy link

SparkQA commented Sep 23, 2018

Test build #96485 has finished for PR 22503 at commit 2f349d7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@justinuang justinuang changed the title [SPARK-25493] [SQL] Fix multiline crlf [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode Sep 24, 2018
@SparkQA
Copy link

SparkQA commented Sep 24, 2018

Test build #96511 has started for PR 22503 at commit 67d11f1.

@justinuang
Copy link
Author

It looks like a flake? Can someone retrigger it?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96511/console

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

Mind explaining what setLineSeparatorDetectionEnabled does in the PR description as well?

settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)

if (multiLine) {
settings.setLineSeparatorDetectionEnabled(true)
Copy link
Member

@MaxGekk MaxGekk Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be simpler just settings.setLineSeparatorDetectionEnabled(multiLine) or settings.setLineSeparatorDetectionEnabled(multiLine == true)?

@SparkQA
Copy link

SparkQA commented Sep 25, 2018

Test build #96556 has finished for PR 22503 at commit 67d11f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 25, 2018

Test build #96559 has finished for PR 22503 at commit 812e4c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

settings.setEmptyValue(emptyValueInRead)
settings.setMaxCharsPerColumn(maxCharsPerColumn)
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
settings.setLineSeparatorDetectionEnabled(multiLine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would multiLine == true.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@HyukjinKwon
Copy link
Member

Seems fine but I or someone else should take a closer look before getting this in.

@justinuang
Copy link
Author

Sounds good, thanks guys =)

@justinuang
Copy link
Author

What does it take to get this to be merged in?

@mccheah
Copy link
Contributor

mccheah commented Oct 2, 2018

@HyukjinKwon is this ready to be merged in, or is there more feedback to be addressed?

@SparkQA
Copy link

SparkQA commented Oct 2, 2018

Test build #96865 has finished for PR 22503 at commit 695f676.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 14, 2018

I haven't checked what setLineSeparatorDetectionEnabled does explicitly yet in Univocity parser. Is this exactly same behaviour when we read it via Hadoop's LineRecordReader? Also how does it work with setLineSeparator? Essentially we should expose this option too (see #20877 (comment)).

@SparkQA
Copy link

SparkQA commented Oct 14, 2018

Test build #97359 has finished for PR 22503 at commit 695f676.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@justinuang
Copy link
Author

So Hadoop's LineReader looks like it handles CR, LF, CRLF:

https://github.com/apache/hadoop/blob/f90c64e6242facf38c2baedeeda42e4a8293e642/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L36

Univocity handles CR, LF, CRLF (the logic is a bit convoluted but it looks like they have the same behavior in that if they see a CR, they will look for a LF next):

https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/input/LineSeparatorDetector.java

I do agree we should expose the option of setLineSeparator, but regardless of that, the default behavior of handling CR, LF, CRLF should be the same between single line and multiline mode.

@HyukjinKwon
Copy link
Member

@justinuang, okay. Mind rebasing this please?

@justinuang
Copy link
Author

done!

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97496 has finished for PR 22503 at commit 040047b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good except one question

}
}

test("crlf line separators in multiline mode") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: -> SPARK-25493: crlf line separators in multiline mode

when a PR fixes a specific problem, let's add the jira prefix in the test name next time.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 1e6c1d8 Oct 19, 2018
@HyukjinKwon
Copy link
Member

@justinuang, this might affect existing users application. Although this matches the behaviour to non-miltiline mode, can we explicitly mention it in migration guide?

cc @cloud-fan and @gatorsmile

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…iline mode

## What changes were proposed in this pull request?

CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode.

## How was this patch tested?

Unit test with a file with crlf line endings.

Closes apache#22503 from justinuang/fix-clrf-multiline.

Authored-by: Justin Uang <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
@jonathanneo
Copy link

jonathanneo commented Jul 2, 2019

@HyukjinKwon CSVs with windows style CR-LF ('\r\n') still doesn't work for me when using multi-line.

I am using Spark 2.4.3 and using the PySpark API.

File: test123-CRLF.zip

When I run the following:
dfCSV = sqlContext.read.format("csv").options(header="true", inferSchema="true", delimiter=",", encoding="UTF-8",escape='"', multiLine="true").load("test123-CRLF.csv")
print(dfCSV.first()) # print the first row

It returns:
Row(Test1='hello', Test2 ='world\r')

@MaxGekk
Copy link
Member

MaxGekk commented Jul 2, 2019

@jonathanneo This was merged to the master, and will be released with Spark 3.0

@jonathanneo
Copy link

@MaxGekk Thanks Max. Is there a known workaround in the meantime?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants