[SPARK-22484][DOC] Document PySpark DataFrame csv writer behavior whe… #19814

gaborgsomogyi · 2017-11-24T15:21:12Z

What changes were proposed in this pull request?

In PySpark API Document, DataFrame.write.csv() says that setting the quote parameter to an empty string should turn off quoting. Instead, it uses the null character as the quote.

This PR fixes the doc.

How was this patch tested?

Manual.

cd python/docs
make html
open _build/html/pyspark.sql.html

…n empty string used as quote

srowen

The doc change seems OK, but I am not sure if the behavior is the intended behavior or not. @HyukjinKwon do you know?

gaborgsomogyi · 2017-11-24T16:01:02Z

As I've seen the univocity library does not support explicit quote turn off feature on the write side.

HyukjinKwon · 2017-11-24T16:46:43Z

I think we should document this in read side too if we are going to do and actually I think it's not intended as the thirdparty one does not allow empty string IIRC. Will take a closer look tomorrow.

HyukjinKwon · 2017-11-26T07:37:25Z

Can we turn off it as documented? We could try to open an issue in Univocity if this functionality is not there and incorporate the change in Spark.

HyukjinKwon · 2017-11-26T07:44:54Z

python/pyspark/sql/readwriter.py

                      separator can be part of the value. If None is set, it uses the default
-                      value, ``"``. If you would like to turn off quotations, you need to set an
-                      empty string.
+                      value, ``"``. If empty string is set, it uses ``u0000``.


If there are doc changes to be done for options here, let's make sure changing all. Quick and easy way will be just grep and replace.

Are you referring to the read part? If so then I left it intentionally there. In the read part there is a test which covers this functionality and works as expected.

CSVSuite.scala:test("SPARK-15585 turn off quotations")

If you mean something else please highlight it allowing me to learn from it.

gaborgsomogyi · 2017-11-26T11:37:01Z

Good point, I'll speak with the Univocity guys... Then the question comes what should we do with this PR and Jira ticket? I'm a newbie so every help/advise is highly appreciated :)

gaborgsomogyi · 2017-11-26T18:58:05Z

Seems like there is a good reason why this is not supported. Here is their opinion:

The writer will only put quotes when absolutely required (i.e. the value has a delimiter, line ending, or another quote in it). If you don't want to output quotes in these cases the parser won't be able to read the data back properly. It makes no sense to force the CSV writer to not use quotes.

If you really want to do this you'd be better off just writing your data directly to an instance of java.io.Writer (FileWriter, OutputStreamWriter, StringWriter, etc)

This case I tend to suggest the original idea. Suggestions/ideas?

HyukjinKwon · 2017-11-26T23:33:17Z

ok to test

SparkQA · 2017-11-27T02:01:54Z

Test build #84200 has finished for PR 19814 at commit 821ee19.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2017-11-27T09:00:14Z

Seems like the job died because of jvm internal issue:

*** glibc detected *** /usr/java/jdk1.8.0_60/bin/java: double free or corruption (out): 0x0000000100038720 ***

Is it possible to restart it?

HyukjinKwon · 2017-11-27T09:03:03Z

Yup, seems unrelated. It's fine. Let me back soon with another look.

HyukjinKwon · 2017-11-27T13:12:28Z

To be honest, I would like to suggest disallow it. I just ran few tests and looks we are still not able to read it back:

Empty quote (\u0000):

Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d \nd")).toDF.write.mode("overwrite").option("quote", "").csv("tmp.csv")
spark.read.option("multiLine", true).option("quote", "").csv("tmp.csv").collect()

Array[org.apache.spark.sql.Row] = Array([ a?,??b ], [b?,null], [ c?,??d ], [d?,null])

If \u0000 really disables quoting when read, I think it should give the same results when quote is another character for example:

Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d \nd")).toDF.write.mode("overwrite").option("quote", "").csv("tmp.csv")
spark.read.option("multiLine", true).option("quote", "^").csv("tmp.csv").collect()

Array[org.apache.spark.sql.Row] = Array([ a?,?b ], [b?,null], [ c?,?d ], [d?,null])

but the output is different as above.

It's Array(0, 98, 32) vs Array(0, 0, 98, 32) in "?b " vs "??b "

Default quote:

Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d \nd")).toDF.write.mode("overwrite").csv("tmp.csv")
spark.read.option("multiLine", true).csv("tmp.csv").collect()

Array[org.apache.spark.sql.Row] =
Array([a
 a,b
b], [c
 c,d
d])

Another quote:

Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d \nd")).toDF.write.mode("overwrite").option("quote", "!").csv("tmp.csv")
spark.read.option("multiLine", true).option("quote", "!").csv("tmp.csv").collect()

Array[org.apache.spark.sql.Row] =
Array([a
 a,b
b], [c
 c,d
d])

HyukjinKwon · 2017-11-27T13:21:53Z

I believe Univocity parser does not describe this behaviour, ignoring quotes when it's set to \u0000 when paring it in?

HyukjinKwon · 2017-11-27T13:27:21Z

Will ask this with the author.

gaborgsomogyi · 2017-11-27T13:29:48Z

OK. Explicit description not found. I've just tested it manually and taken a look at the source. It tries to match the character with the configured quote and it never matches because u0000 is end of string character.

All in all I believe we have multiple problems:

The documentation lies. This PR meant to solve it.
There are exception thrown or not producing the same output when u0000 used for writing and reading. For this we could turn off the u0000 usage.

HyukjinKwon · 2017-11-27T13:42:10Z

python/pyspark/sql/readwriter.py

                      separator can be part of the value. If None is set, it uses the default
-                      value, ``"``. If you would like to turn off quotations, you need to set an
-                      empty string.
+                      value, ``"``. If empty string is set, it uses ``u0000``.


If empty string -> If an empty string

and maybe small additional comments for u0000 would be nicer, e.g., it's null character.

HyukjinKwon · 2017-11-27T13:43:30Z

Okay, I am fine with the current change for now. This PR fix the lie BTW and seems it's difficult to fix the behaviour as intended.

Will probably make a followup after the small talk with the author separately if possible and required.

gaborgsomogyi · 2017-11-27T13:59:25Z

Suggested fix added. Happy to contribute in the followups if there are possibilities. Thanks :)

SparkQA · 2017-11-27T15:24:04Z

Test build #84219 has finished for PR 19814 at commit 4bf4876.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-27T15:39:10Z

Test build #3992 has finished for PR 19814 at commit 821ee19.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-27T16:47:05Z

Test build #84220 has finished for PR 19814 at commit 46efe32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-11-28T01:12:55Z

Please see the small discussion in https://github.com/uniVocity/univocity-parsers/issues/215 for more information if anyone wonders about this PR.

HyukjinKwon · 2017-11-28T01:15:31Z

Merged to master.

[SPARK-22484][DOC] Document PySpark DataFrame csv writer behavior whe…

821ee19

…n empty string used as quote

srowen reviewed Nov 24, 2017

View reviewed changes

HyukjinKwon reviewed Nov 26, 2017

View reviewed changes

HyukjinKwon approved these changes Nov 27, 2017

View reviewed changes

gaborgsomogyi added 2 commits November 27, 2017 14:55

Review fix

4bf4876

Review fix

46efe32

asfgit closed this in 33d43bf Nov 28, 2017

gaborgsomogyi deleted the SPARK-22484 branch November 28, 2017 08:14

[SPARK-22484][DOC] Document PySpark DataFrame csv writer behavior whe… #19814

[SPARK-22484][DOC] Document PySpark DataFrame csv writer behavior whe… #19814

Uh oh!

Conversation

gaborgsomogyi commented Nov 24, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi commented Nov 24, 2017

Uh oh!

HyukjinKwon commented Nov 24, 2017

Uh oh!

HyukjinKwon commented Nov 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon Nov 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi Nov 26, 2017

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi commented Nov 26, 2017

Uh oh!

gaborgsomogyi commented Nov 26, 2017

Uh oh!

HyukjinKwon commented Nov 26, 2017

Uh oh!

SparkQA commented Nov 27, 2017

Uh oh!

gaborgsomogyi commented Nov 27, 2017

Uh oh!

HyukjinKwon commented Nov 27, 2017

Uh oh!

HyukjinKwon commented Nov 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Nov 27, 2017

Uh oh!

HyukjinKwon commented Nov 27, 2017

Uh oh!

gaborgsomogyi commented Nov 27, 2017

Uh oh!

HyukjinKwon Nov 27, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 27, 2017

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi Nov 27, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaborgsomogyi commented Nov 27, 2017

Uh oh!

SparkQA commented Nov 27, 2017

Uh oh!

SparkQA commented Nov 27, 2017

Uh oh!

SparkQA commented Nov 27, 2017

Uh oh!

HyukjinKwon commented Nov 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Nov 28, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Nov 26, 2017 •

edited

Loading

HyukjinKwon Nov 26, 2017 •

edited

Loading

HyukjinKwon commented Nov 27, 2017 •

edited

Loading

HyukjinKwon commented Nov 27, 2017 •

edited

Loading

HyukjinKwon commented Nov 28, 2017 •

edited

Loading