Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Jul 27, 2018

What changes were proposed in this pull request?

In the PR, I propose to upgrade uniVocity parser from 2.6.3 to 2.7.2. The recent version includes a fix for the SPARK-24645 issue. Here is the bug report for uniVocity uniVocity/univocity-parsers#250.

I removed the changes in UnivocityParser introduced by the commit: bd32b50 but leaved the test from the commit.

How was this patch tested?

I tested by CSVSuite and by running CSVBenchmarks. The difference between 2.6.3 and 2.7.2 is 0.2% - 8% except a benchmark for count(). Performance degradation in the last case is x3.8.

Before changes:

Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
One quoted string                           33336 / 34122          0.0      666727.0       1.0X

Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Select 1000 columns                         90287 / 91713          0.0       90286.9       1.0X
Select 100 columns                          31826 / 36589          0.0       31826.4       2.8X
Select one column                           25738 / 25872          0.0       25737.9       3.5X
count()                                       6931 / 7269          0.1        6931.5      13.0X

after:

Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
One quoted string                           34191 / 34332          0.0      683826.7       1.0X

Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Select 1000 columns                         90446 / 91900          0.0       90446.1       1.0X
Select 100 columns                          34315 / 39895          0.0       34314.9       2.6X
Select one column                           27955 / 28125          0.0       27954.8       3.2X
count()                                     27713 / 27803          0.0       27712.8       3.3X

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 27, 2018

@HyukjinKwon @maropu Please, take a look at the PR. Is it valid to just count number of lines returned by Hadoop LineReader if required schema is empty and do not call parser at all? Maybe there are some corner cases when parser must be called?

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93665 has finished for PR 21892 at commit b116987.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me but I or someone else should take a close look before getting this in.

@gatorsmile
Copy link
Member

@MaxGekk @HyukjinKwon We are unable to merge this PR since the performance regression is very obvious.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 30, 2018

Ah, that looks going to be addressed in #21909 if you refer the number for count() which's not going to execute the parsing path that this upgrade affects - here's what I meant by a close look.

@gatorsmile
Copy link
Member

gatorsmile commented Jul 30, 2018

@HyukjinKwon We need to rerun the perf tests after #21909 is merged.

We are also unable to accept the perf regression larger than 3-5%. Based on the description, we still face a performance regression besides count()

@HyukjinKwon
Copy link
Member

Yea, we should run. 0.2% - 8% can be made by different environment, etc given my past running benchmarks. I am not saying we should merge this now but seems fine because the big perf diff will be fixed in another PR and other numbers look potentially cased other factors.

@gatorsmile
Copy link
Member

@HyukjinKwon I would suggest to skip this upgrade and then we can get 3.5 times perf improvement for count() .

@HyukjinKwon
Copy link
Member

we can get 3.5 times perf improvement for count().

We shouldn't merge this as is for clarification. it needs a close look after #21909 which avoids the parsing code path by Univocity. How about this: after #21909, we should rerun the benchmark and see the numbers, and decide if we should reject this one or not.

@gatorsmile
Copy link
Member

sounds good to me.

@jbax
Copy link

jbax commented Jul 31, 2018

Did anyone have a chance to test with the 2.7.3-SNAPSHOT build I released to see if the performance issue has been addressed? If it has then let me know and I'll release the final 2.7.3 build.

@gatorsmile
Copy link
Member

@jbax Thanks for the info!

ping @MaxGekk @HyukjinKwon

@MaxGekk
Copy link
Member Author

MaxGekk commented Jul 31, 2018

@jbax I got the following exception on 2.7.3-SNAPSHOT (commit e51b0958a):

Internal state when error was thrown: line=20, column=20481, record=20, charIndex=82594, headers=[col0,..., col999]
	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:369)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:673)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$7.apply(UnivocityParser.scala:333)
	...
Caused by: java.lang.ArrayIndexOutOfBoundsException: 20480
	at com.univocity.parsers.common.ParserOutput.valueParsed(ParserOutput.java:316)
	at com.univocity.parsers.csv.CsvParser.parseRecord(CsvParser.java:160)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:654)
	... 23 more

This happened on a CSV file with 1000 columns with header and the set of selected indexes is empty. Our settings are:

Parser Configuration: CsvParserSettings:
	Auto configuration enabled=true
	Autodetect column delimiter=false
	Autodetect quotes=false
	Column reordering enabled=true
	Delimiters for detection=null
	Empty value=
	Escape unquoted values=false
	Header extraction enabled=null
	Headers=null
	Ignore leading whitespaces=false
	Ignore leading whitespaces in quotes=false
	Ignore trailing whitespaces=false
	Ignore trailing whitespaces in quotes=false
	Input buffer size=128
	Input reading on separate thread=false
	Keep escape sequences=false
	Keep quotes=false
	Length of content displayed on error=-1
	Line separator detection enabled=false
	Maximum number of characters per column=-1
	Maximum number of columns=20480
	Normalize escaped line separators=true
	Null value=
	Number of records to read=all
	Processor=none
	Restricting data in exceptions=false
	RowProcessor error handler=null
	Selected fields=field selection: []
	Skip bits as whitespace=true
	Skip empty lines=true
	Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
	CsvFormat:
		Comment character=\0
		Field delimiter=,
		Line separator (normalized)=\n
		Line separator sequence=\n
		Quote character="
		Quote escape character=\
		Quote escape escape character=null

Here is the input file (3.5GB uncompressed) - test.csv.xz (you need to change extension):
test.csv.zip

@jbax
Copy link

jbax commented Aug 1, 2018

Thanks @MaxGekk I've fixed the error and also made the parser run faster than before when processing fields that were not selected in general.

Can you please retest with the latest SNAPSHOT build and let me know how it goes?

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 1, 2018

@jbax It became really faster:

Parsing quoted values:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
One quoted string                           33411 / 33510          0.0      668211.4       1.0X

Wide rows with 1000 columns:             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Select 1000 columns                         88028 / 89311          0.0       88028.1       1.0X
Select 100 columns                          29010 / 32755          0.0       29010.1       3.0X
Select one column                           22936 / 22953          0.0       22936.5       3.8X
count()                                     22790 / 23143          0.0       22789.6       3.9X

The count() benchmark is still slower because I reverted the optimization for empty schema. Before we didn't call uniVocity's parseLine if the set of selected indexes is empty. In this PR, I call parseLine for empty set since the bug (uniVocity/univocity-parsers#250) has been fixed. It seems it performs similar to the case when only one column is selected. So, the overhead per line is around 15.5 milliseconds on my CPU.

@gatorsmile
Copy link
Member

Great! Let us wait for 2.7.3 build? @jbax When will it be released?

}
}

private val doParse = if (requiredSchema.nonEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me though, we need this in this pr? This fix should be addressed in #21909?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, switching to 2.7.3 is valuable itself. I will revert the changes and postpone it to #21909 .

@jbax
Copy link

jbax commented Aug 2, 2018

univocity-parsers-2.7.3 released. Thanks!

@maropu
Copy link
Member

maropu commented Aug 2, 2018

Also, can you update the description?

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 2, 2018

I opened the separate PR for switching on 2.7.3. Please, take a look at #21969

@asfgit asfgit closed this in b3f2911 Aug 3, 2018
@MaxGekk MaxGekk deleted the univocity-2_7_2 branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants