[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode #22503

justinuang · 2018-09-20T20:54:03Z

What changes were proposed in this pull request?

CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode.

How was this patch tested?

Unit test with a file with crlf line endings.

MaxGekk · 2018-09-21T13:25:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

    settings.setEmptyValue(emptyValueInRead)
    settings.setMaxCharsPerColumn(maxCharsPerColumn)
    settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
+    settings.setLineSeparatorDetectionEnabled(true)


The auto-detection mechanism is enabled for both - multi-line and per-line mode. I guess it has some overhead on detection of new lines which is not needed in per-line mode. I would benchmark it in both modes (see CSVBenchmarks), and if the overhead in per-line mode is significant, I would not enable the option when multiLine is set to false.

HyukjinKwon · 2018-09-23T05:41:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

    settings.setEmptyValue(emptyValueInRead)
    settings.setMaxCharsPerColumn(maxCharsPerColumn)
    settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
+    settings.setLineSeparatorDetectionEnabled(true)


Yup, I would rather enable this only for multiline mode. Also, please add what this configuration does in the PR description.

HyukjinKwon · 2018-09-23T05:41:10Z

ok to test

HyukjinKwon · 2018-09-23T05:51:04Z

Also, please fix the PR title to be more descriptive. For instance, [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode.

SparkQA · 2018-09-23T07:05:02Z

Test build #96485 has finished for PR 22503 at commit 2f349d7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-24T14:56:59Z

Test build #96511 has started for PR 22503 at commit 67d11f1.

justinuang · 2018-09-25T15:29:39Z

It looks like a flake? Can someone retrigger it?
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96511/console

HyukjinKwon · 2018-09-25T15:41:52Z

retest this please

HyukjinKwon · 2018-09-25T15:42:21Z

Mind explaining what setLineSeparatorDetectionEnabled does in the PR description as well?

MaxGekk · 2018-09-25T15:54:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

    settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
+
+    if (multiLine) {
+      settings.setLineSeparatorDetectionEnabled(true)


Would be simpler just settings.setLineSeparatorDetectionEnabled(multiLine) or settings.setLineSeparatorDetectionEnabled(multiLine == true)?

SparkQA · 2018-09-25T19:36:36Z

Test build #96556 has finished for PR 22503 at commit 67d11f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-25T20:42:42Z

Test build #96559 has finished for PR 22503 at commit 812e4c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-26T02:25:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

    settings.setEmptyValue(emptyValueInRead)
    settings.setMaxCharsPerColumn(maxCharsPerColumn)
    settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
+    settings.setLineSeparatorDetectionEnabled(multiLine)


I would multiLine == true.

HyukjinKwon · 2018-09-26T02:26:21Z

Seems fine but I or someone else should take a closer look before getting this in.

justinuang · 2018-09-26T14:48:46Z

Sounds good, thanks guys =)

justinuang · 2018-09-28T19:36:16Z

What does it take to get this to be merged in?

mccheah · 2018-10-02T17:50:13Z

@HyukjinKwon is this ready to be merged in, or is there more feedback to be addressed?

SparkQA · 2018-10-02T22:00:37Z

Test build #96865 has finished for PR 22503 at commit 695f676.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-14T08:06:51Z

ok to test

HyukjinKwon · 2018-10-14T08:09:22Z

I haven't checked what setLineSeparatorDetectionEnabled does explicitly yet in Univocity parser. Is this exactly same behaviour when we read it via Hadoop's LineRecordReader? Also how does it work with setLineSeparator? Essentially we should expose this option too (see #20877 (comment)).

SparkQA · 2018-10-14T11:37:01Z

Test build #97359 has finished for PR 22503 at commit 695f676.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

justinuang · 2018-10-16T18:39:31Z

So Hadoop's LineReader looks like it handles CR, LF, CRLF:

https://github.com/apache/hadoop/blob/f90c64e6242facf38c2baedeeda42e4a8293e642/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L36

Univocity handles CR, LF, CRLF (the logic is a bit convoluted but it looks like they have the same behavior in that if they see a CR, they will look for a LF next):

https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/input/LineSeparatorDetector.java

I do agree we should expose the option of setLineSeparator, but regardless of that, the default behavior of handling CR, LF, CRLF should be the same between single line and multiline mode.

HyukjinKwon · 2018-10-17T04:43:57Z

@justinuang, okay. Mind rebasing this please?

justinuang · 2018-10-17T15:42:18Z

done!

SparkQA · 2018-10-17T19:17:32Z

Test build #97496 has finished for PR 22503 at commit 040047b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/resources/test-data/cars-crlf.csv

HyukjinKwon

looks good except one question

HyukjinKwon · 2018-10-19T03:12:43Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

    }
  }

+  test("crlf line separators in multiline mode") {


nit: -> SPARK-25493: crlf line separators in multiline mode

when a PR fixes a specific problem, let's add the jira prefix in the test name next time.

HyukjinKwon · 2018-10-19T03:12:56Z

Merged to master.

HyukjinKwon · 2018-10-25T05:54:19Z

@justinuang, this might affect existing users application. Although this matches the behaviour to non-miltiline mode, can we explicitly mention it in migration guide?

cc @cloud-fan and @gatorsmile

…iline mode ## What changes were proposed in this pull request? CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode. ## How was this patch tested? Unit test with a file with crlf line endings. Closes apache#22503 from justinuang/fix-clrf-multiline. Authored-by: Justin Uang <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

jonathanneo · 2019-07-02T08:17:38Z

@HyukjinKwon CSVs with windows style CR-LF ('\r\n') still doesn't work for me when using multi-line.

I am using Spark 2.4.3 and using the PySpark API.

File: test123-CRLF.zip

When I run the following:
dfCSV = sqlContext.read.format("csv").options(header="true", inferSchema="true", delimiter=",", encoding="UTF-8",escape='"', multiLine="true").load("test123-CRLF.csv")
print(dfCSV.first()) # print the first row

It returns:
Row(Test1='hello', Test2 ='world\r')

MaxGekk · 2019-07-02T09:15:46Z

@jonathanneo This was merged to the master, and will be released with Spark 3.0

jonathanneo · 2019-07-02T12:23:30Z

@MaxGekk Thanks Max. Is there a known workaround in the meantime?

MaxGekk reviewed Sep 21, 2018

View reviewed changes

HyukjinKwon reviewed Sep 23, 2018

View reviewed changes

justinuang changed the title ~~[SPARK-25493] [SQL] Fix multiline crlf~~ [SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode Sep 24, 2018

MaxGekk reviewed Sep 25, 2018

View reviewed changes

HyukjinKwon reviewed Sep 26, 2018

View reviewed changes

Justin Uang added 5 commits October 17, 2018 11:41

Fix multiline crlf

f317891

remove unnecessary line

05c2fcb

Only turn on line separator detection on multiline mode

aedfbd7

simplify setting line detection

2a2e65e

address cr

040047b

justinuang force-pushed the fix-clrf-multiline branch from 695f676 to 040047b Compare October 17, 2018 15:41

HyukjinKwon reviewed Oct 18, 2018

View reviewed changes

sql/core/src/test/resources/test-data/cars-crlf.csv Show resolved Hide resolved

HyukjinKwon approved these changes Oct 18, 2018

View reviewed changes

HyukjinKwon reviewed Oct 19, 2018

View reviewed changes

asfgit closed this in 1e6c1d8 Oct 19, 2018

HyukjinKwon mentioned this pull request Apr 6, 2019

[SPARK-26108][SQL] Support custom lineSep in CSV datasource #23080

Closed

[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode #22503

[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode #22503

Uh oh!

Conversation

justinuang commented Sep 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

MaxGekk Sep 21, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 23, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 23, 2018

Uh oh!

HyukjinKwon commented Sep 23, 2018

Uh oh!

SparkQA commented Sep 23, 2018

Uh oh!

SparkQA commented Sep 24, 2018

Uh oh!

justinuang commented Sep 25, 2018

Uh oh!

HyukjinKwon commented Sep 25, 2018

Uh oh!

HyukjinKwon commented Sep 25, 2018

Uh oh!

MaxGekk Sep 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 25, 2018

Uh oh!

SparkQA commented Sep 25, 2018

Uh oh!

HyukjinKwon Sep 26, 2018

Choose a reason for hiding this comment

Uh oh!

justinuang Oct 2, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 26, 2018

Uh oh!

justinuang commented Sep 26, 2018

Uh oh!

justinuang commented Sep 28, 2018

Uh oh!

mccheah commented Oct 2, 2018

Uh oh!

SparkQA commented Oct 2, 2018

Uh oh!

HyukjinKwon commented Oct 14, 2018

Uh oh!

HyukjinKwon commented Oct 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 14, 2018

Uh oh!

justinuang commented Oct 16, 2018

Uh oh!

HyukjinKwon commented Oct 17, 2018

Uh oh!

justinuang commented Oct 17, 2018

Uh oh!

SparkQA commented Oct 17, 2018

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 19, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 19, 2018

Uh oh!

HyukjinKwon commented Oct 25, 2018

Uh oh!

jonathanneo commented Jul 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

justinuang commented Sep 20, 2018 •

edited

Loading

MaxGekk Sep 25, 2018 •

edited

Loading

HyukjinKwon commented Oct 14, 2018 •

edited

Loading

jonathanneo commented Jul 2, 2019 •

edited

Loading