Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jan 17, 2017

What changes were proposed in this pull request?

This PR proposes two types of APIs as below:

Array as an option value in readers/writers

As an concrete implementation to use it, multiple values for nullValue is dealt with here together (SPARK-17878).

Python - list/tuple of any object that can be converted into string

spark.read.format('csv') \
    .option("nullValue", ['Tom', 'Joe']) \
    ...

Scala - Seq[String]

spark.read.format("csv")
  .option("nullValue", Seq("2012", "Tesla", "null"))
  ...

Java - String[]

spark.read().format("csv")
  .option("nullValue", new String[]{"", "null", "NA"})
  ...

SQL - Python's list-like form of integer, decimal, string and boolean

CREATE TEMPORARY TABLE tableA USING csv
OPTIONS (nullValue [2012, 1.1, 'null'], ...)

Unsets an option in readers/writers

Scala

spark.read.unsetOption("optionKey")
  ...

Java

spark.read().unsetOption("optionKey")
  ...

Python

spark.read.unsetOption("optionKey")
  ...

In case of R, there seems requiring a quite bit of more changes. It will be a follow-up.

How was this patch tested?

Unit tests in each suite and manually.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jan 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have hesitated to submit this PR due to this problem for few weeks. After thinking hard, I decided to submit this.

The test below:

 .option("some-null-value-option", null) 

was testing of setting null. Because String is the only reference type (we have Double, Long and Boolean which are AnyVal and they can't be null),
it was fine so far.

Now, Array[String] is added and the compiler gose confused between both reference types and therefore we should explictly give the type, String to null.
Unless we do a runtime checking, it seems we can't add overridden versions with any reference type anymore without fixing this test.

So, yes, this breaks it if any users were setting null without type but I wonder if it is worth to keep this behaviour. There are a lot of APIs that do not allow null without type so far, e.g., functions.array(null) up to my knowledge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not pretty sure of this. cc @rxin, could you please check this out if you have some time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this is okay BTW @rxin ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add an option unset method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me try.

@HyukjinKwon
Copy link
Member Author

(cc @falaki too.)

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71499 has finished for PR 16611 at commit 387e723.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71504 has finished for PR 16611 at commit bcb23b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@falaki
Copy link
Contributor

falaki commented Jan 17, 2017

@HyukjinKwon as I laid out in the JIRA a major problem with this approach for specifying multiple options is that it won't work in DDL. What is wrong with having a numbered list. E.g.: nullValue1, nullValue2, etc

@rxin
Copy link
Contributor

rxin commented Jan 17, 2017

Rather than just submitting code, can you put down the interfaces concisely either in a doc or the pr description? As @falaki said, we need this to work in DDL too. It is possible to just extend the DDL syntax to support multiple values.

@HyukjinKwon
Copy link
Member Author

I didn't mean to not support this in R and DDL syntax for this..

@HyukjinKwon
Copy link
Member Author

Ah, sure. Let me give a shot.

@HyukjinKwon HyukjinKwon changed the title [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for array as an option for datasources and for multiple values in nullValue in CSV [WIP][SPARK-17967][SPARK-17878][SQL][PYTHON] Support for array as an option for datasources and for multiple values in nullValue in CSV Jan 17, 2017
@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-17967][SPARK-17878][SQL][PYTHON] Support for array as an option for datasources and for multiple values in nullValue in CSV [SPARK-17967][SPARK-17878][SQL][PYTHON] Support for array as an option for datasources and for multiple values in nullValue in CSV Jan 18, 2017
@HyukjinKwon
Copy link
Member Author

I just added DDL support with some more tests with fixed PR description. Could you please take another look and see if it makes sense?

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71590 has finished for PR 16611 at commit 563f943.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71600 has finished for PR 16611 at commit 563f943.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 18, 2017

Test build #71601 has finished for PR 16611 at commit 563f943.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71649 has finished for PR 16611 at commit 196a45e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71650 has finished for PR 16611 at commit 79482f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2017

Test build #71771 has finished for PR 16611 at commit 3bb8753.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 22, 2017

Test build #71782 has finished for PR 16611 at commit 3bb8753.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 22, 2017

Test build #71788 has finished for PR 16611 at commit 28abf86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@rxin, does that look okay to you? I am worried if

SQL - array-like form of integer, decimal, string and boolean

sounds okay to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also support Seq in scala.

@rxin
Copy link
Contributor

rxin commented Feb 16, 2017

For SQL, rather than "array", can we follow Python, e.g.

CREATE TEMPORARY TABLE tableA USING csv
OPTIONS (nullValue ['NA', 'null'], ...)

@HyukjinKwon
Copy link
Member Author

Sure, I will rebase and update.

@HyukjinKwon
Copy link
Member Author

Per 2f78cc7, I ran a build with Scala 2.10 as well.

@SparkQA
Copy link

SparkQA commented Feb 17, 2017

Test build #73039 has finished for PR 16611 at commit d7b202e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 17, 2017

Test build #73041 has finished for PR 16611 at commit 60c7e25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 17, 2017

Test build #73042 has finished for PR 16611 at commit 2f78cc7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@rxin, does this sounds okay to you?

@SparkQA
Copy link

SparkQA commented Mar 4, 2017

Test build #73905 has finished for PR 16611 at commit 9f7e679.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@rxin, please let me know if there is anything you are not sure of. I will double check. I am fine with closing too if you are not sure of the implementation for now.

@SparkQA
Copy link

SparkQA commented Mar 29, 2017

Test build #75359 has finished for PR 16611 at commit 29e28b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

gentle ping @rxin

@HyukjinKwon
Copy link
Member Author

gentle ping ...

@SparkQA
Copy link

SparkQA commented Jul 2, 2017

Test build #79040 has finished for PR 16611 at commit be628fe.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 2, 2017

Test build #79041 has finished for PR 16611 at commit be628fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

gentle ping ...

@SparkQA
Copy link

SparkQA commented Sep 4, 2017

Test build #81385 has finished for PR 16611 at commit 4c1a012.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Hi @gatorsmile, WDYT about this PR? I was looking through my old PRs to close or update. I wonder if you think it looks fine to go ahead to you.

@gatorsmile
Copy link
Member

This sounds fine to me, but we have to split this PR to multiple smaller one with more test cases. For example, we can start it from the SQL interface.

Do you know how the other systems implement such a similar feature?

@HyukjinKwon
Copy link
Member Author

Thanks @gatorsmile. Sure, let me open a smaller one and cc you.

I know one reference in R:

> d <- "col1,col2
+ 1,3
+ 2,4"
> df <- read.csv(text=d, na.strings=c("3", "2"))
> df
  col1 col2
1    1   NA
2   NA    4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants