-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode #22503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| settings.setEmptyValue(emptyValueInRead) | ||
| settings.setMaxCharsPerColumn(maxCharsPerColumn) | ||
| settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER) | ||
| settings.setLineSeparatorDetectionEnabled(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The auto-detection mechanism is enabled for both - multi-line and per-line mode. I guess it has some overhead on detection of new lines which is not needed in per-line mode. I would benchmark it in both modes (see CSVBenchmarks), and if the overhead in per-line mode is significant, I would not enable the option when multiLine is set to false.
| settings.setEmptyValue(emptyValueInRead) | ||
| settings.setMaxCharsPerColumn(maxCharsPerColumn) | ||
| settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER) | ||
| settings.setLineSeparatorDetectionEnabled(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I would rather enable this only for multiline mode. Also, please add what this configuration does in the PR description.
|
ok to test |
|
Also, please fix the PR title to be more descriptive. For instance, |
|
Test build #96485 has finished for PR 22503 at commit
|
|
Test build #96511 has started for PR 22503 at commit |
|
It looks like a flake? Can someone retrigger it? |
|
retest this please |
|
Mind explaining what |
| settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER) | ||
|
|
||
| if (multiLine) { | ||
| settings.setLineSeparatorDetectionEnabled(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be simpler just settings.setLineSeparatorDetectionEnabled(multiLine) or settings.setLineSeparatorDetectionEnabled(multiLine == true)?
|
Test build #96556 has finished for PR 22503 at commit
|
|
Test build #96559 has finished for PR 22503 at commit
|
| settings.setEmptyValue(emptyValueInRead) | ||
| settings.setMaxCharsPerColumn(maxCharsPerColumn) | ||
| settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER) | ||
| settings.setLineSeparatorDetectionEnabled(multiLine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would multiLine == true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Seems fine but I or someone else should take a closer look before getting this in. |
|
Sounds good, thanks guys =) |
|
What does it take to get this to be merged in? |
|
@HyukjinKwon is this ready to be merged in, or is there more feedback to be addressed? |
|
Test build #96865 has finished for PR 22503 at commit
|
|
ok to test |
|
I haven't checked what |
|
Test build #97359 has finished for PR 22503 at commit
|
|
So Hadoop's LineReader looks like it handles CR, LF, CRLF: Univocity handles CR, LF, CRLF (the logic is a bit convoluted but it looks like they have the same behavior in that if they see a CR, they will look for a LF next): I do agree we should expose the option of |
|
@justinuang, okay. Mind rebasing this please? |
695f676 to
040047b
Compare
|
done! |
|
Test build #97496 has finished for PR 22503 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good except one question
| } | ||
| } | ||
|
|
||
| test("crlf line separators in multiline mode") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: -> SPARK-25493: crlf line separators in multiline mode
when a PR fixes a specific problem, let's add the jira prefix in the test name next time.
|
Merged to master. |
|
@justinuang, this might affect existing users application. Although this matches the behaviour to non-miltiline mode, can we explicitly mention it in migration guide? cc @cloud-fan and @gatorsmile |
…iline mode
## What changes were proposed in this pull request?
CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode.
## How was this patch tested?
Unit test with a file with crlf line endings.
Closes apache#22503 from justinuang/fix-clrf-multiline.
Authored-by: Justin Uang <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
|
@HyukjinKwon CSVs with windows style CR-LF ('\r\n') still doesn't work for me when using multi-line. I am using Spark 2.4.3 and using the PySpark API. File: test123-CRLF.zip When I run the following: It returns: |
|
@jonathanneo This was merged to the master, and will be released with Spark 3.0 |
|
@MaxGekk Thanks Max. Is there a known workaround in the meantime? |
What changes were proposed in this pull request?
CSVs with windows style crlf ('\r\n') don't work in multiline mode. They work fine in single line mode because the line separation is done by Hadoop, which can handle all the different types of line separators. This PR fixes it by enabling Univocity's line separator detection in multiline mode, which will detect '\r\n', '\r', or '\n' automatically as it is done by hadoop in single line mode.
How was this patch tested?
Unit test with a file with crlf line endings.