Skip to content

Conversation

@gaborgsomogyi
Copy link
Contributor

What changes were proposed in this pull request?

In PySpark API Document, DataFrame.write.csv() says that setting the quote parameter to an empty string should turn off quoting. Instead, it uses the null character as the quote.

This PR fixes the doc.

How was this patch tested?

Manual.

cd python/docs
make html
open _build/html/pyspark.sql.html

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc change seems OK, but I am not sure if the behavior is the intended behavior or not. @HyukjinKwon do you know?

@gaborgsomogyi
Copy link
Contributor Author

As I've seen the univocity library does not support explicit quote turn off feature on the write side.

@HyukjinKwon
Copy link
Member

I think we should document this in read side too if we are going to do and actually I think it's not intended as the thirdparty one does not allow empty string IIRC. Will take a closer look tomorrow.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 26, 2017

Can we turn off it as documented? We could try to open an issue in Univocity if this functionality is not there and incorporate the change in Spark.

separator can be part of the value. If None is set, it uses the default
value, ``"``. If you would like to turn off quotations, you need to set an
empty string.
value, ``"``. If empty string is set, it uses ``u0000``.
Copy link
Member

@HyukjinKwon HyukjinKwon Nov 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are doc changes to be done for options here, let's make sure changing all. Quick and easy way will be just grep and replace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the read part? If so then I left it intentionally there. In the read part there is a test which covers this functionality and works as expected.

CSVSuite.scala:test("SPARK-15585 turn off quotations")

If you mean something else please highlight it allowing me to learn from it.

@gaborgsomogyi
Copy link
Contributor Author

Good point, I'll speak with the Univocity guys... Then the question comes what should we do with this PR and Jira ticket? I'm a newbie so every help/advise is highly appreciated :)

@gaborgsomogyi
Copy link
Contributor Author

Seems like there is a good reason why this is not supported. Here is their opinion:

The writer will only put quotes when absolutely required (i.e. the value has a delimiter, line ending, or another quote in it). If you don't want to output quotes in these cases the parser won't be able to read the data back properly. It makes no sense to force the CSV writer to not use quotes.

If you really want to do this you'd be better off just writing your data directly to an instance of java.io.Writer (FileWriter, OutputStreamWriter, StringWriter, etc)

This case I tend to suggest the original idea. Suggestions/ideas?

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84200 has finished for PR 19814 at commit 821ee19.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

Seems like the job died because of jvm internal issue:

*** glibc detected *** /usr/java/jdk1.8.0_60/bin/java: double free or corruption (out): 0x0000000100038720 ***

Is it possible to restart it?

@HyukjinKwon
Copy link
Member

Yup, seems unrelated. It's fine. Let me back soon with another look.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 27, 2017

To be honest, I would like to suggest disallow it. I just ran few tests and looks we are still not able to read it back:

Empty quote (\u0000):

Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d \nd")).toDF.write.mode("overwrite").option("quote", "").csv("tmp.csv")
spark.read.option("multiLine", true).option("quote", "").csv("tmp.csv").collect()
Array[org.apache.spark.sql.Row] = Array([ a?,??b ], [b?,null], [ c?,??d ], [d?,null])

If \u0000 really disables quoting when read, I think it should give the same results when quote is another character for example:

Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d \nd")).toDF.write.mode("overwrite").option("quote", "").csv("tmp.csv")
spark.read.option("multiLine", true).option("quote", "^").csv("tmp.csv").collect()
Array[org.apache.spark.sql.Row] = Array([ a?,?b ], [b?,null], [ c?,?d ], [d?,null])

but the output is different as above.

It's Array(0, 98, 32) vs Array(0, 0, 98, 32) in "?b " vs "??b "

Default quote:

Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d \nd")).toDF.write.mode("overwrite").csv("tmp.csv")
spark.read.option("multiLine", true).csv("tmp.csv").collect()
Array[org.apache.spark.sql.Row] =
Array([a
 a,b
b], [c
 c,d
d])

Another quote:

Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d \nd")).toDF.write.mode("overwrite").option("quote", "!").csv("tmp.csv")
spark.read.option("multiLine", true).option("quote", "!").csv("tmp.csv").collect()
Array[org.apache.spark.sql.Row] =
Array([a
 a,b
b], [c
 c,d
d])

@HyukjinKwon
Copy link
Member

I believe Univocity parser does not describe this behaviour, ignoring quotes when it's set to \u0000 when paring it in?

@HyukjinKwon
Copy link
Member

Will ask this with the author.

@gaborgsomogyi
Copy link
Contributor Author

OK. Explicit description not found. I've just tested it manually and taken a look at the source. It tries to match the character with the configured quote and it never matches because u0000 is end of string character.

All in all I believe we have multiple problems:

  1. The documentation lies. This PR meant to solve it.
  2. There are exception thrown or not producing the same output when u0000 used for writing and reading. For this we could turn off the u0000 usage.

separator can be part of the value. If None is set, it uses the default
value, ``"``. If you would like to turn off quotations, you need to set an
empty string.
value, ``"``. If empty string is set, it uses ``u0000``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If empty string -> If an empty string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe small additional comments for u0000 would be nicer, e.g., it's null character.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing...

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 27, 2017

Okay, I am fine with the current change for now. This PR fix the lie BTW and seems it's difficult to fix the behaviour as intended.

Will probably make a followup after the small talk with the author separately if possible and required.

@gaborgsomogyi
Copy link
Contributor Author

Suggested fix added. Happy to contribute in the followups if there are possibilities. Thanks :)

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84219 has finished for PR 19814 at commit 4bf4876.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #3992 has finished for PR 19814 at commit 821ee19.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2017

Test build #84220 has finished for PR 19814 at commit 46efe32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Nov 28, 2017

Please see the small discussion in https://github.com/uniVocity/univocity-parsers/issues/215 for more information if anyone wonders about this PR.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 33d43bf Nov 28, 2017
@gaborgsomogyi gaborgsomogyi deleted the SPARK-22484 branch November 28, 2017 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants