-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-24945][SQL] Switching to uniVocity 2.7.2 #21892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@HyukjinKwon @maropu Please, take a look at the PR. Is it valid to just count number of lines returned by Hadoop LineReader if required schema is empty and do not call parser at all? Maybe there are some corner cases when parser must be called? |
|
Test build #93665 has finished for PR 21892 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me but I or someone else should take a close look before getting this in.
|
@MaxGekk @HyukjinKwon We are unable to merge this PR since the performance regression is very obvious. |
|
Ah, that looks going to be addressed in #21909 if you refer the number for |
|
@HyukjinKwon We need to rerun the perf tests after #21909 is merged. We are also unable to accept the perf regression larger than |
|
Yea, we should run. 0.2% - 8% can be made by different environment, etc given my past running benchmarks. I am not saying we should merge this now but seems fine because the big perf diff will be fixed in another PR and other numbers look potentially cased other factors. |
|
@HyukjinKwon I would suggest to skip this upgrade and then we can get 3.5 times perf improvement for |
We shouldn't merge this as is for clarification. it needs a close look after #21909 which avoids the parsing code path by Univocity. How about this: after #21909, we should rerun the benchmark and see the numbers, and decide if we should reject this one or not. |
|
sounds good to me. |
|
Did anyone have a chance to test with the 2.7.3-SNAPSHOT build I released to see if the performance issue has been addressed? If it has then let me know and I'll release the final 2.7.3 build. |
|
@jbax Thanks for the info! ping @MaxGekk @HyukjinKwon |
|
@jbax I got the following exception on 2.7.3-SNAPSHOT (commit e51b0958a): This happened on a CSV file with 1000 columns with header and the set of selected indexes is empty. Our settings are: Here is the input file (3.5GB uncompressed) - test.csv.xz (you need to change extension): |
|
Thanks @MaxGekk I've fixed the error and also made the parser run faster than before when processing fields that were not selected in general. Can you please retest with the latest SNAPSHOT build and let me know how it goes? |
|
@jbax It became really faster: The |
|
Great! Let us wait for 2.7.3 build? @jbax When will it be released? |
| } | ||
| } | ||
|
|
||
| private val doParse = if (requiredSchema.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks good to me though, we need this in this pr? This fix should be addressed in #21909?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, switching to 2.7.3 is valuable itself. I will revert the changes and postpone it to #21909 .
|
univocity-parsers-2.7.3 released. Thanks! |
|
Also, can you update the description? |
|
I opened the separate PR for switching on 2.7.3. Please, take a look at #21969 |
What changes were proposed in this pull request?
In the PR, I propose to upgrade uniVocity parser from 2.6.3 to 2.7.2. The recent version includes a fix for the SPARK-24645 issue. Here is the bug report for uniVocity uniVocity/univocity-parsers#250.
I removed the changes in
UnivocityParserintroduced by the commit: bd32b50 but leaved the test from the commit.How was this patch tested?
I tested by
CSVSuiteand by runningCSVBenchmarks. The difference between 2.6.3 and 2.7.2 is 0.2% - 8% except a benchmark forcount(). Performance degradation in the last case is x3.8.Before changes:
after: