-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-22484][DOC] Document PySpark DataFrame csv writer behavior whe… #19814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…n empty string used as quote
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc change seems OK, but I am not sure if the behavior is the intended behavior or not. @HyukjinKwon do you know?
|
As I've seen the univocity library does not support explicit quote turn off feature on the write side. |
|
I think we should document this in read side too if we are going to do and actually I think it's not intended as the thirdparty one does not allow empty string IIRC. Will take a closer look tomorrow. |
|
Can we turn off it as documented? We could try to open an issue in Univocity if this functionality is not there and incorporate the change in Spark. |
python/pyspark/sql/readwriter.py
Outdated
| separator can be part of the value. If None is set, it uses the default | ||
| value, ``"``. If you would like to turn off quotations, you need to set an | ||
| empty string. | ||
| value, ``"``. If empty string is set, it uses ``u0000``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are doc changes to be done for options here, let's make sure changing all. Quick and easy way will be just grep and replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the read part? If so then I left it intentionally there. In the read part there is a test which covers this functionality and works as expected.
CSVSuite.scala:test("SPARK-15585 turn off quotations")
If you mean something else please highlight it allowing me to learn from it.
|
Good point, I'll speak with the Univocity guys... Then the question comes what should we do with this PR and Jira ticket? I'm a newbie so every help/advise is highly appreciated :) |
|
Seems like there is a good reason why this is not supported. Here is their opinion:
This case I tend to suggest the original idea. Suggestions/ideas? |
|
ok to test |
|
Test build #84200 has finished for PR 19814 at commit
|
|
Seems like the job died because of jvm internal issue:
Is it possible to restart it? |
|
Yup, seems unrelated. It's fine. Let me back soon with another look. |
|
To be honest, I would like to suggest disallow it. I just ran few tests and looks we are still not able to read it back: Empty Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d \nd")).toDF.write.mode("overwrite").option("quote", "").csv("tmp.csv")
spark.read.option("multiLine", true).option("quote", "").csv("tmp.csv").collect()If \u0000 really disables quoting when read, I think it should give the same results when but the output is different as above. It's Array(0, 98, 32) vs Array(0, 0, 98, 32) in Default Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d \nd")).toDF.write.mode("overwrite").csv("tmp.csv")
spark.read.option("multiLine", true).csv("tmp.csv").collect()Another Seq(Tuple2("a\n a", "b \nb"), Tuple2("c\n c", "d \nd")).toDF.write.mode("overwrite").option("quote", "!").csv("tmp.csv")
spark.read.option("multiLine", true).option("quote", "!").csv("tmp.csv").collect() |
|
I believe Univocity parser does not describe this behaviour, ignoring quotes when it's set to |
|
Will ask this with the author. |
|
OK. Explicit description not found. I've just tested it manually and taken a look at the source. It tries to match the character with the configured quote and it never matches because u0000 is end of string character. All in all I believe we have multiple problems:
|
python/pyspark/sql/readwriter.py
Outdated
| separator can be part of the value. If None is set, it uses the default | ||
| value, ``"``. If you would like to turn off quotations, you need to set an | ||
| empty string. | ||
| value, ``"``. If empty string is set, it uses ``u0000``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If empty string -> If an empty string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe small additional comments for u0000 would be nicer, e.g., it's null character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing...
|
Okay, I am fine with the current change for now. This PR fix the lie BTW and seems it's difficult to fix the behaviour as intended. Will probably make a followup after the small talk with the author separately if possible and required. |
|
Suggested fix added. Happy to contribute in the followups if there are possibilities. Thanks :) |
|
Test build #84219 has finished for PR 19814 at commit
|
|
Test build #3992 has finished for PR 19814 at commit
|
|
Test build #84220 has finished for PR 19814 at commit
|
|
Please see the small discussion in |
|
Merged to master. |
What changes were proposed in this pull request?
In PySpark API Document, DataFrame.write.csv() says that setting the quote parameter to an empty string should turn off quoting. Instead, it uses the null character as the quote.
This PR fixes the doc.
How was this patch tested?
Manual.