[SPARK-17967][SPARK-17878][SQL][PYTHON] Support for array as an option for datasources and for multiple values in nullValue in CSV #16611

HyukjinKwon · 2017-01-17T09:16:31Z

What changes were proposed in this pull request?

This PR proposes two types of APIs as below:

Array as an option value in readers/writers

As an concrete implementation to use it, multiple values for nullValue is dealt with here together (SPARK-17878).

Python - list/tuple of any object that can be converted into string

spark.read.format('csv') \
    .option("nullValue", ['Tom', 'Joe']) \
    ...

Scala - Seq[String]

spark.read.format("csv")
  .option("nullValue", Seq("2012", "Tesla", "null"))
  ...

Java - String[]

spark.read().format("csv")
  .option("nullValue", new String[]{"", "null", "NA"})
  ...

SQL - Python's list-like form of integer, decimal, string and boolean

CREATE TEMPORARY TABLE tableA USING csv
OPTIONS (nullValue [2012, 1.1, 'null'], ...)

Unsets an option in readers/writers

Scala

spark.read.unsetOption("optionKey")
  ...

Java

spark.read().unsetOption("optionKey")
  ...

Python

spark.read.unsetOption("optionKey")
  ...

In case of R, there seems requiring a quite bit of more changes. It will be a follow-up.

How was this patch tested?

Unit tests in each suite and manually.

HyukjinKwon · 2017-01-17T09:17:39Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextHadoopFsRelationSuite.scala

I have hesitated to submit this PR due to this problem for few weeks. After thinking hard, I decided to submit this.

The test below:

.option("some-null-value-option", null)

was testing of setting null. Because String is the only reference type (we have Double, Long and Boolean which are AnyVal and they can't be null),
it was fine so far.

Now, Array[String] is added and the compiler gose confused between both reference types and therefore we should explictly give the type, String to null.
Unless we do a runtime checking, it seems we can't add overridden versions with any reference type anymore without fixing this test.

So, yes, this breaks it if any users were setting null without type but I wonder if it is worth to keep this behaviour. There are a lot of APIs that do not allow null without type so far, e.g., functions.array(null) up to my knowledge.

I am still not pretty sure of this. cc @rxin, could you please check this out if you have some time?

Do you think this is okay BTW @rxin ?

we can add an option unset method?

Sure, let me try.

HyukjinKwon · 2017-01-17T09:24:09Z

(cc @falaki too.)

SparkQA · 2017-01-17T11:04:37Z

Test build #71499 has finished for PR 16611 at commit 387e723.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-17T14:09:27Z

Test build #71504 has finished for PR 16611 at commit bcb23b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

falaki · 2017-01-17T18:13:41Z

@HyukjinKwon as I laid out in the JIRA a major problem with this approach for specifying multiple options is that it won't work in DDL. What is wrong with having a numbered list. E.g.: nullValue1, nullValue2, etc

rxin · 2017-01-17T18:43:29Z

Rather than just submitting code, can you put down the interfaces concisely either in a doc or the pr description? As @falaki said, we need this to work in DDL too. It is possible to just extend the DDL syntax to support multiple values.

HyukjinKwon · 2017-01-17T18:53:52Z

I didn't mean to not support this in R and DDL syntax for this..

HyukjinKwon · 2017-01-17T18:54:25Z

Ah, sure. Let me give a shot.

HyukjinKwon · 2017-01-18T10:02:29Z

I just added DDL support with some more tests with fixed PR description. Could you please take another look and see if it makes sense?

SparkQA · 2017-01-18T11:24:20Z

Test build #71590 has finished for PR 16611 at commit 563f943.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-18T12:02:30Z

retest this please

SparkQA · 2017-01-18T12:11:28Z

Test build #71600 has finished for PR 16611 at commit 563f943.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-18T12:32:39Z

retest this please

SparkQA · 2017-01-18T15:15:04Z

Test build #71601 has finished for PR 16611 at commit 563f943.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-19T10:50:22Z

Test build #71649 has finished for PR 16611 at commit 196a45e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-19T11:17:18Z

Test build #71650 has finished for PR 16611 at commit 79482f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-21T18:43:29Z

Test build #71771 has finished for PR 16611 at commit 3bb8753.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-01-22T00:24:35Z

retest this please

SparkQA · 2017-01-22T02:03:01Z

Test build #71782 has finished for PR 16611 at commit 3bb8753.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-22T04:30:12Z

Test build #71788 has finished for PR 16611 at commit 28abf86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-02-11T13:34:23Z

@rxin, does that look okay to you? I am worried if

SQL - array-like form of integer, decimal, string and boolean

sounds okay to you.

rxin · 2017-02-16T16:00:05Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

I'd also support Seq in scala.

rxin · 2017-02-16T16:06:19Z

For SQL, rather than "array", can we follow Python, e.g.

CREATE TEMPORARY TABLE tableA USING csv
OPTIONS (nullValue ['NA', 'null'], ...)

HyukjinKwon · 2017-02-17T01:04:18Z

Sure, I will rebase and update.

HyukjinKwon · 2017-02-17T06:05:06Z

Per 2f78cc7, I ran a build with Scala 2.10 as well.

SparkQA · 2017-02-17T07:04:05Z

Test build #73039 has finished for PR 16611 at commit d7b202e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-17T07:57:44Z

Test build #73041 has finished for PR 16611 at commit 60c7e25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-17T08:02:57Z

Test build #73042 has finished for PR 16611 at commit 2f78cc7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-02-20T00:43:09Z

@rxin, does this sounds okay to you?

SparkQA · 2017-03-04T17:50:34Z

Test build #73905 has finished for PR 16611 at commit 9f7e679.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-03-14T00:50:51Z

@rxin, please let me know if there is anything you are not sure of. I will double check. I am fine with closing too if you are not sure of the implementation for now.

SparkQA · 2017-03-29T17:11:38Z

Test build #75359 has finished for PR 16611 at commit 29e28b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-04-18T01:34:22Z

gentle ping @rxin

HyukjinKwon · 2017-05-11T14:12:27Z

gentle ping ...

SparkQA · 2017-07-02T07:04:56Z

Test build #79040 has finished for PR 16611 at commit be628fe.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-02T07:38:49Z

retest this please

SparkQA · 2017-07-02T10:15:53Z

Test build #79041 has finished for PR 16611 at commit be628fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-24T02:34:15Z

gentle ping ...

…s in nullValue

SparkQA · 2017-09-04T15:37:27Z

Test build #81385 has finished for PR 16611 at commit 4c1a012.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-12-27T00:49:52Z

Hi @gatorsmile, WDYT about this PR? I was looking through my old PRs to close or update. I wonder if you think it looks fine to go ahead to you.

gatorsmile · 2017-12-30T07:15:31Z

This sounds fine to me, but we have to split this PR to multiple smaller one with more test cases. For example, we can start it from the SQL interface.

Do you know how the other systems implement such a similar feature?

HyukjinKwon · 2017-12-30T10:15:47Z

Thanks @gatorsmile. Sure, let me open a smaller one and cc you.

I know one reference in R:

> d <- "col1,col2
+ 1,3
+ 2,4"
> df <- read.csv(text=d, na.strings=c("3", "2"))
> df

  col1 col2
1    1   NA
2   NA    4

HyukjinKwon commented Jan 17, 2017

View reviewed changes

HyukjinKwon force-pushed the SPARK-17967 branch from 79482f7 to 3bb8753 Compare January 21, 2017 16:32

rxin reviewed Feb 16, 2017

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala Outdated

Copy link

Contributor

rxin Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also support Seq in scala.

HyukjinKwon force-pushed the SPARK-17967 branch from 28abf86 to d7b202e Compare February 17, 2017 05:39

HyukjinKwon force-pushed the SPARK-17967 branch from 2f78cc7 to 9f7e679 Compare March 4, 2017 15:37

HyukjinKwon force-pushed the SPARK-17967 branch from 9f7e679 to 29e28b2 Compare March 29, 2017 14:45

HyukjinKwon force-pushed the SPARK-17967 branch from 29e28b2 to be628fe Compare July 2, 2017 06:01

HyukjinKwon added 2 commits September 4, 2017 21:55

Support for array as an option for datasources and for multiple value…

7e424f0

…s in nullValue

Change 2.2.0 to 2.3.0

4c1a012

HyukjinKwon force-pushed the SPARK-17967 branch from be628fe to 4c1a012 Compare September 4, 2017 12:56

HyukjinKwon closed this Dec 30, 2017

HyukjinKwon mentioned this pull request Dec 31, 2017

[SPARK-17967][SQL] Support for array as an option in SQL parser #20125

Closed

HyukjinKwon deleted the SPARK-17967 branch January 2, 2018 03:37

HyukjinKwon mentioned this pull request Apr 30, 2018

[SPARK-24118][SQL] Flexible format for the lineSep option of Text and JSON datasources #21192

Closed

HyukjinKwon mentioned this pull request Nov 14, 2023

[SPARK-45597][PYTHON][SQL] Support creating table using a Python data source in SQL (codegen) #43784

Closed

[SPARK-17967][SPARK-17878][SQL][PYTHON] Support for array as an option for datasources and for multiple values in nullValue in CSV #16611

[SPARK-17967][SPARK-17878][SQL][PYTHON] Support for array as an option for datasources and for multiple values in nullValue in CSV #16611

Uh oh!

Conversation

HyukjinKwon commented Jan 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Array as an option value in readers/writers

Unsets an option in readers/writers

How was this patch tested?

Uh oh!

HyukjinKwon Jan 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 17, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 17, 2017

Choose a reason for hiding this comment

Uh oh!

rxin Jan 17, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 18, 2017

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 17, 2017

Uh oh!

SparkQA commented Jan 17, 2017

Uh oh!

SparkQA commented Jan 17, 2017

Uh oh!

falaki commented Jan 17, 2017

Uh oh!

rxin commented Jan 17, 2017

Uh oh!

HyukjinKwon commented Jan 17, 2017

Uh oh!

HyukjinKwon commented Jan 17, 2017

Uh oh!

HyukjinKwon commented Jan 18, 2017

Uh oh!

SparkQA commented Jan 18, 2017

Uh oh!

HyukjinKwon commented Jan 18, 2017

Uh oh!

SparkQA commented Jan 18, 2017

Uh oh!

HyukjinKwon commented Jan 18, 2017

Uh oh!

SparkQA commented Jan 18, 2017

Uh oh!

SparkQA commented Jan 19, 2017

Uh oh!

SparkQA commented Jan 19, 2017

Uh oh!

SparkQA commented Jan 21, 2017

Uh oh!

HyukjinKwon commented Jan 22, 2017

Uh oh!

SparkQA commented Jan 22, 2017

Uh oh!

SparkQA commented Jan 22, 2017

Uh oh!

HyukjinKwon commented Feb 11, 2017

Uh oh!

rxin Feb 16, 2017

Choose a reason for hiding this comment

Uh oh!

rxin commented Feb 16, 2017

Uh oh!

HyukjinKwon commented Feb 17, 2017

Uh oh!

HyukjinKwon commented Feb 17, 2017

Uh oh!

SparkQA commented Feb 17, 2017

Uh oh!

SparkQA commented Feb 17, 2017

Uh oh!

HyukjinKwon commented Jan 17, 2017 •

edited

Loading

HyukjinKwon Jan 17, 2017 •

edited

Loading