[SPARK-24945][SQL] Switching to uniVocity 2.7.2 #21892

MaxGekk · 2018-07-27T13:46:21Z

What changes were proposed in this pull request?

In the PR, I propose to upgrade uniVocity parser from 2.6.3 to 2.7.2. The recent version includes a fix for the SPARK-24645 issue. Here is the bug report for uniVocity uniVocity/univocity-parsers#250.

I removed the changes in UnivocityParser introduced by the commit: bd32b50 but leaved the test from the commit.

How was this patch tested?

I tested by CSVSuite and by running CSVBenchmarks. The difference between 2.6.3 and 2.7.2 is 0.2% - 8% except a benchmark for count(). Performance degradation in the last case is x3.8.

Before changes:

Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
One quoted string                           33336 / 34122          0.0      666727.0       1.0X

Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Select 1000 columns                         90287 / 91713          0.0       90286.9       1.0X
Select 100 columns                          31826 / 36589          0.0       31826.4       2.8X
Select one column                           25738 / 25872          0.0       25737.9       3.5X
count()                                       6931 / 7269          0.1        6931.5      13.0X

after:

Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
One quoted string                           34191 / 34332          0.0      683826.7       1.0X

Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Select 1000 columns                         90446 / 91900          0.0       90446.1       1.0X
Select 100 columns                          34315 / 39895          0.0       34314.9       2.6X
Select one column                           27955 / 28125          0.0       27954.8       3.2X
count()                                     27713 / 27803          0.0       27712.8       3.3X

MaxGekk · 2018-07-27T13:48:18Z

@HyukjinKwon @maropu Please, take a look at the PR. Is it valid to just count number of lines returned by Hadoop LineReader if required schema is empty and do not call parser at all? Maybe there are some corner cases when parser must be called?

SparkQA · 2018-07-27T18:21:02Z

Test build #93665 has finished for PR 21892 at commit b116987.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Seems fine to me but I or someone else should take a close look before getting this in.

gatorsmile · 2018-07-30T07:03:46Z

@MaxGekk @HyukjinKwon We are unable to merge this PR since the performance regression is very obvious.

HyukjinKwon · 2018-07-30T07:09:44Z

Ah, that looks going to be addressed in #21909 if you refer the number for count() which's not going to execute the parsing path that this upgrade affects - here's what I meant by a close look.

gatorsmile · 2018-07-30T07:13:13Z

@HyukjinKwon We need to rerun the perf tests after #21909 is merged.

We are also unable to accept the perf regression larger than 3-5%. Based on the description, we still face a performance regression besides count()

HyukjinKwon · 2018-07-30T07:15:46Z

Yea, we should run. 0.2% - 8% can be made by different environment, etc given my past running benchmarks. I am not saying we should merge this now but seems fine because the big perf diff will be fixed in another PR and other numbers look potentially cased other factors.

gatorsmile · 2018-07-30T07:18:16Z

@HyukjinKwon I would suggest to skip this upgrade and then we can get 3.5 times perf improvement for count() .

HyukjinKwon · 2018-07-30T07:20:57Z

we can get 3.5 times perf improvement for count().

We shouldn't merge this as is for clarification. it needs a close look after #21909 which avoids the parsing code path by Univocity. How about this: after #21909, we should rerun the benchmark and see the numbers, and decide if we should reject this one or not.

gatorsmile · 2018-07-30T07:21:41Z

sounds good to me.

jbax · 2018-07-31T08:07:21Z

Did anyone have a chance to test with the 2.7.3-SNAPSHOT build I released to see if the performance issue has been addressed? If it has then let me know and I'll release the final 2.7.3 build.

gatorsmile · 2018-07-31T15:52:11Z

@jbax Thanks for the info!

ping @MaxGekk @HyukjinKwon

MaxGekk · 2018-07-31T19:32:58Z

@jbax I got the following exception on 2.7.3-SNAPSHOT (commit e51b0958a):

Internal state when error was thrown: line=20, column=20481, record=20, charIndex=82594, headers=[col0,..., col999]
	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:369)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:673)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$7.apply(UnivocityParser.scala:333)
	...
Caused by: java.lang.ArrayIndexOutOfBoundsException: 20480
	at com.univocity.parsers.common.ParserOutput.valueParsed(ParserOutput.java:316)
	at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:160)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:654)
	... 23 more

This happened on a CSV file with 1000 columns with header and the set of selected indexes is empty. Our settings are:

Parser Configuration: CsvParserSettings:
	Auto configuration enabled=true
	Autodetect column delimiter=false
	Autodetect quotes=false
	Column reordering enabled=true
	Delimiters for detection=null
	Empty value=
	Escape unquoted values=false
	Header extraction enabled=null
	Headers=null
	Ignore leading whitespaces=false
	Ignore leading whitespaces in quotes=false
	Ignore trailing whitespaces=false
	Ignore trailing whitespaces in quotes=false
	Input buffer size=128
	Input reading on separate thread=false
	Keep escape sequences=false
	Keep quotes=false
	Length of content displayed on error=-1
	Line separator detection enabled=false
	Maximum number of characters per column=-1
	Maximum number of columns=20480
	Normalize escaped line separators=true
	Null value=
	Number of records to read=all
	Processor=none
	Restricting data in exceptions=false
	RowProcessor error handler=null
	Selected fields=field selection: []
	Skip bits as whitespace=true
	Skip empty lines=true
	Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
	CsvFormat:
		Comment character=\0
		Field delimiter=,
		Line separator (normalized)=\n
		Line separator sequence=\n
		Quote character="
		Quote escape character=\
		Quote escape escape character=null

Here is the input file (3.5GB uncompressed) - test.csv.xz (you need to change extension):
test.csv.zip

jbax · 2018-08-01T06:27:21Z

Thanks @MaxGekk I've fixed the error and also made the parser run faster than before when processing fields that were not selected in general.

Can you please retest with the latest SNAPSHOT build and let me know how it goes?

MaxGekk · 2018-08-01T21:02:40Z

@jbax It became really faster:

Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
One quoted string                           33411 / 33510          0.0      668211.4       1.0X

Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Select 1000 columns                         88028 / 89311          0.0       88028.1       1.0X
Select 100 columns                          29010 / 32755          0.0       29010.1       3.0X
Select one column                           22936 / 22953          0.0       22936.5       3.8X
count()                                     22790 / 23143          0.0       22789.6       3.9X

The count() benchmark is still slower because I reverted the optimization for empty schema. Before we didn't call uniVocity's parseLine if the set of selected indexes is empty. In this PR, I call parseLine for empty set since the bug (uniVocity/univocity-parsers#250) has been fixed. It seems it performs similar to the case when only one column is selected. So, the overhead per line is around 15.5 milliseconds on my CPU.

gatorsmile · 2018-08-01T21:11:29Z

Great! Let us wait for 2.7.3 build? @jbax When will it be released?

maropu · 2018-08-02T02:05:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala

    }
  }

-  private val doParse = if (requiredSchema.nonEmpty) {


This change looks good to me though, we need this in this pr? This fix should be addressed in #21909?

Right, switching to 2.7.3 is valuable itself. I will revert the changes and postpone it to #21909 .

jbax · 2018-08-02T06:26:09Z

univocity-parsers-2.7.3 released. Thanks!

maropu · 2018-08-02T06:39:37Z

Also, can you update the description?

MaxGekk · 2018-08-02T12:13:19Z

I opened the separate PR for switching on 2.7.3. Please, take a look at #21969

MaxGekk added 2 commits July 27, 2018 13:59

Bumping version of uniVocity parser up to 2.7.2

7b569ae

Call uniVocity even the set of selected columns is empty

b116987

MaxGekk mentioned this pull request Jul 27, 2018

Inconsistent result of parseLine on empty set of selected indexes uniVocity/univocity-parsers#250

Closed

HyukjinKwon reviewed Jul 30, 2018

View reviewed changes

maropu reviewed Aug 2, 2018

View reviewed changes

MaxGekk mentioned this pull request Aug 2, 2018

[SPARK-24945][SQL] Switching to uniVocity 2.7.3 #21969

Closed

asfgit closed this in b3f2911 Aug 3, 2018

MaxGekk deleted the univocity-2_7_2 branch August 17, 2019 13:33

[SPARK-24945][SQL] Switching to uniVocity 2.7.2 #21892

[SPARK-24945][SQL] Switching to uniVocity 2.7.2 #21892

Uh oh!

Conversation

MaxGekk commented Jul 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

MaxGekk commented Jul 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 27, 2018

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jul 30, 2018

Uh oh!

HyukjinKwon commented Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jul 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jul 30, 2018

Uh oh!

gatorsmile commented Jul 30, 2018

Uh oh!

HyukjinKwon commented Jul 30, 2018

Uh oh!

gatorsmile commented Jul 30, 2018

Uh oh!

jbax commented Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gatorsmile commented Jul 31, 2018

Uh oh!

MaxGekk commented Jul 31, 2018

Uh oh!

jbax commented Aug 1, 2018

Uh oh!

MaxGekk commented Aug 1, 2018

Uh oh!

gatorsmile commented Aug 1, 2018

Uh oh!

maropu Aug 2, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Aug 2, 2018

Choose a reason for hiding this comment

Uh oh!

jbax commented Aug 2, 2018

Uh oh!

maropu commented Aug 2, 2018

Uh oh!

MaxGekk commented Aug 2, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MaxGekk commented Jul 27, 2018 •

edited

Loading

MaxGekk commented Jul 27, 2018 •

edited

Loading

HyukjinKwon commented Jul 30, 2018 •

edited

Loading

gatorsmile commented Jul 30, 2018 •

edited

Loading

jbax commented Jul 31, 2018 •

edited

Loading