-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17967][SPARK-17878][SQL][PYTHON] Support for array as an option for datasources and for multiple values in nullValue in CSV #16611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have hesitated to submit this PR due to this problem for few weeks. After thinking hard, I decided to submit this.
The test below:
.option("some-null-value-option", null) was testing of setting null. Because String is the only reference type (we have Double, Long and Boolean which are AnyVal and they can't be null),
it was fine so far.
Now, Array[String] is added and the compiler gose confused between both reference types and therefore we should explictly give the type, String to null.
Unless we do a runtime checking, it seems we can't add overridden versions with any reference type anymore without fixing this test.
So, yes, this breaks it if any users were setting null without type but I wonder if it is worth to keep this behaviour. There are a lot of APIs that do not allow null without type so far, e.g., functions.array(null) up to my knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not pretty sure of this. cc @rxin, could you please check this out if you have some time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this is okay BTW @rxin ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add an option unset method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me try.
|
(cc @falaki too.) |
|
Test build #71499 has finished for PR 16611 at commit
|
|
Test build #71504 has finished for PR 16611 at commit
|
|
@HyukjinKwon as I laid out in the JIRA a major problem with this approach for specifying multiple options is that it won't work in DDL. What is wrong with having a numbered list. E.g.: |
|
Rather than just submitting code, can you put down the interfaces concisely either in a doc or the pr description? As @falaki said, we need this to work in DDL too. It is possible to just extend the DDL syntax to support multiple values. |
|
I didn't mean to not support this in R and DDL syntax for this.. |
|
Ah, sure. Let me give a shot. |
|
I just added DDL support with some more tests with fixed PR description. Could you please take another look and see if it makes sense? |
|
Test build #71590 has finished for PR 16611 at commit
|
|
retest this please |
|
Test build #71600 has finished for PR 16611 at commit
|
|
retest this please |
|
Test build #71601 has finished for PR 16611 at commit
|
|
Test build #71649 has finished for PR 16611 at commit
|
|
Test build #71650 has finished for PR 16611 at commit
|
79482f7 to
3bb8753
Compare
|
Test build #71771 has finished for PR 16611 at commit
|
|
retest this please |
|
Test build #71782 has finished for PR 16611 at commit
|
|
Test build #71788 has finished for PR 16611 at commit
|
|
@rxin, does that look okay to you? I am worried if
sounds okay to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also support Seq in scala.
|
For SQL, rather than "array", can we follow Python, e.g. |
|
Sure, I will rebase and update. |
28abf86 to
d7b202e
Compare
|
Per 2f78cc7, I ran a build with Scala 2.10 as well. |
|
Test build #73039 has finished for PR 16611 at commit
|
|
Test build #73041 has finished for PR 16611 at commit
|
|
Test build #73042 has finished for PR 16611 at commit
|
|
@rxin, does this sounds okay to you? |
|
Test build #73905 has finished for PR 16611 at commit
|
|
@rxin, please let me know if there is anything you are not sure of. I will double check. I am fine with closing too if you are not sure of the implementation for now. |
9f7e679 to
29e28b2
Compare
|
Test build #75359 has finished for PR 16611 at commit
|
|
gentle ping @rxin |
|
gentle ping ... |
|
Test build #79040 has finished for PR 16611 at commit
|
|
retest this please |
|
Test build #79041 has finished for PR 16611 at commit
|
|
gentle ping ... |
be628fe to
4c1a012
Compare
|
Test build #81385 has finished for PR 16611 at commit
|
|
Hi @gatorsmile, WDYT about this PR? I was looking through my old PRs to close or update. I wonder if you think it looks fine to go ahead to you. |
|
This sounds fine to me, but we have to split this PR to multiple smaller one with more test cases. For example, we can start it from the SQL interface. Do you know how the other systems implement such a similar feature? |
|
Thanks @gatorsmile. Sure, let me open a smaller one and cc you. I know one reference in R: > d <- "col1,col2
+ 1,3
+ 2,4"
> df <- read.csv(text=d, na.strings=c("3", "2"))
> df |
What changes were proposed in this pull request?
This PR proposes two types of APIs as below:
Array as an option value in readers/writers
As an concrete implementation to use it, multiple values for
nullValueis dealt with here together (SPARK-17878).Python -
list/tupleof any object that can be converted into stringScala -
Seq[String]Java -
String[]SQL - Python's list-like form of integer, decimal, string and boolean
Unsets an option in readers/writers
Scala
spark.read.unsetOption("optionKey") ...Java
Python
In case of R, there seems requiring a quite bit of more changes. It will be a follow-up.
How was this patch tested?
Unit tests in each suite and manually.