[SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc #13727

tdas · 2016-06-17T03:39:25Z

What changes were proposed in this pull request?

Issues with current reader behavior.

text() without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with value string field,
textFile() without args fails with exception because of the above reason, it expected the DF returned by text() to have a value field.
orc() does not have var args, inconsistent with others
json(single-arg) was removed, but that caused source compatibility issues - SPARK-16009
user specified schema was not respected when text/csv/... were used with no args - SPARK-16007

The solution I am implementing is to do the following.

For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs).
Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string)
Deduped docs and fixed their formatting.

How was this patch tested?

Added new unit tests for Scala and Java tests

tdas · 2016-06-17T03:39:41Z

@marmbrus @rxin

tdas · 2016-06-17T03:40:02Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

   * @since 1.4.0
   */
  def load(): DataFrame = {
-    val dataSource =


SparkQA · 2016-06-17T05:28:34Z

Test build #60679 has finished for PR 13727 at commit 960048d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-17T06:23:48Z

Test build #60683 has finished for PR 13727 at commit 3384473.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-17T06:34:47Z

The following test case is for SPARK-16007. I can confirm this PR already fixes the issue.

  test("schema checking") {
    val schema: StructType = new StructType().add("s", "string")
    assert(spark.read.schema(schema).csv().schema === schema)
    assert(spark.read.schema(schema).json().schema === schema)
    assert(spark.read.schema(schema).parquet().schema === schema)
    assert(spark.read.schema(schema).text().schema === schema)
    assert(spark.read.schema(schema).orc().schema === schema)
  }

Since the ORC data source must be used with Hive support enabled, you can comment the last line out. Or move it to another test case in Hive.

Please let me know if anything is needed. Thanks!

tdas · 2016-06-17T08:36:17Z

@gatorsmile Thanks for testing this! I updated my tests in this PR to rigorously test these. However since this PR is about testing the common behavior (e.g. whether all of them respect user schema or not), I have not added tests for the case where there are no paths AND no user schema. The behavior will be source-specific - parquet/json/csv should throw error, whereas text should not. Could you make a PR that tests these in CSVSuite, etc. if they are not already tested?

SparkQA · 2016-06-17T09:56:09Z

Test build #60696 has finished for PR 13727 at commit bb52410.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-17T10:21:45Z

Test build #60695 has finished for PR 13727 at commit 3150b01.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-06-17T14:18:14Z

@tdas Sure, will do it soon. I might submit it after this is merged. Thanks!

marmbrus · 2016-06-17T20:39:16Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

+    // This method ensures that calls that explicit need single argument works, see SPARK-16009
+    csv(Seq(path): _*)
+  }
+


This can be addressed in a follow up, but I don't think we should duplicate docs cause they are going to get out of sync. I'd have one canonical one and the other link to it.

marmbrus · 2016-06-17T21:28:58Z

A few comments. Overall LGTM.

SparkQA · 2016-06-18T03:29:05Z

Test build #60747 has finished for PR 13727 at commit 29524b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2016-06-20T10:54:59Z

@marmbrus I deduped the docs. Linking to specific method using scaladocs was hard to use and did not work in Java docs. So I just wrote it "See docs on other overloaded method". And I fixed the doc formatting.

SparkQA · 2016-06-20T12:20:36Z

Test build #60846 has finished for PR 13727 at commit 24174f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-20T12:30:03Z

Test build #60847 has finished for PR 13727 at commit 3498bd0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-20T13:45:24Z

Test build #60850 has finished for PR 13727 at commit 2539a94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-06-20T21:51:59Z

LGTM. Merging to master and 2.0. Thanks!

…f DataFrameReader.text/csv/json/parquet/orc ## What changes were proposed in this pull request? Issues with current reader behavior. - `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field, - `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field. - `orc()` does not have var args, inconsistent with others - `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009) - user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007) The solution I am implementing is to do the following. - For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs). - Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string) - Deduped docs and fixed their formatting. ## How was this patch tested? Added new unit tests for Scala and Java tests Author: Tathagata Das <[email protected]> Closes #13727 from tdas/SPARK-15982. (cherry picked from commit b99129c) Signed-off-by: Shixiong Zhu <[email protected]>

gatorsmile · 2016-06-21T01:40:35Z

Tonight, will submit a PR for test cases. Thanks!

tdas · 2016-06-21T02:04:43Z

@gatorsmile Actually, I already added tests in this PR that cover the scenario where schema is not provided. The only one that is not really tested in the orc, as it cannot be run in the DataFrameReaderWriterSuite. So could you add that test in the ORC-related test suites, if at all needed?

gatorsmile · 2016-06-21T02:56:13Z

@tdas Sure, I have multiple related test cases I wrote a few days ago. It might be also useful. You can judge whether they are needed or not. Let me merge your latest changes. : )

gatorsmile · 2016-06-22T02:50:36Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+    // from user schema
+    val expData = Seq[String](null, null, null)
+    testRead(spark.read.schema(userSchema).parquet(), Seq.empty, userSchema)
+    testRead(spark.read.schema(userSchema).parquet(dir), expData, userSchema)


@tdas ORC behaves differently. When the user-specified schema does not match the physical schema, it simply stops and reports an exception. Do you think that behavior is better than returning null for all the rows?

Let's documents that as a test for now.
On Jun 21, 2016 7:51 PM, "Xiao Li" [email protected] wrote:

In
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
#13727 (comment):

// Reader, without user specified schema

intercept[AnalysisException] {

testRead(spark.read.parquet(), Seq.empty, schema)

}

testRead(spark.read.parquet(dir), data, schema)

testRead(spark.read.parquet(dir, dir), data ++ data, schema)

testRead(spark.read.parquet(Seq(dir, dir): _*), data ++ data, schema)

// Test explicit calls to single arg method - SPARK-16009

testRead(Option(dir).map(spark.read.parquet).get, data, schema)

// Reader, with user specified schema, data should be nulls as schema in file different

// from user schema

val expData = Seq[String](null, null, null)

testRead(spark.read.schema(userSchema).parquet(), Seq.empty, userSchema)

testRead(spark.read.schema(userSchema).parquet(dir), expData, userSchema)

@tdas https://github.com/tdas ORC behaves differently. When the
user-specified schema does not match the physical schema, it simply stops
and reports an exception. Do you think that behavior is better than
returning null for all the rows?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/13727/files/2539a947d382d8d6c21c59fe8a9420a15aad9b9a#r67987384,
or mute the thread
https://github.com/notifications/unsubscribe/AAoerKNkJhd5r8QzCyrTjIY8aMnl9uVZks5qOKMygaJpZM4I4AVc
.

Sure, will do it. Thanks!

koertkuipers · 2016-06-27T17:56:48Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

   */
  def load(path: String): DataFrame = {
-    option("path", path).load()
+    load(Seq(path): _*) // force invocation of `load(...varargs...)`


with this change path is no longer available in the options. this makes it hard (impossible?) for non-file based DataSources (not implementing FileFormat) to use load(...)

For example for elasticsearch we use:

sqlContext.read.format("org.elasticsearch.spark.sql").load(resource)

i do not think this can be implemented anymore now?

sqlContext.read.option("path", resource).format("org.elasticsearch.spark.sql").load()

Can you try this?

i believe that works as expected (i am running into some other issues now, but they seem unrelated).
however from a DSL perspective this is not very pretty?

it will also break users code in an upgrade

yea this is a bad breaking change.

Do you want me to fix it?

If you can!

Sure, will do it soon. Thanks!

#### What changes were proposed in this pull request? koertkuipers identified the PR #13727 changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you! This PR is to add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one `path` in the `load` API. For example, users can see the `path` option after the following API call, ```Scala spark.read .format("parquet") .load("/test") ``` #### How was this patch tested? Added test cases. Author: gatorsmile <[email protected]> Closes #13933 from gatorsmile/optionPath. (cherry picked from commit 25520e9) Signed-off-by: Reynold Xin <[email protected]>

#### What changes were proposed in this pull request? koertkuipers identified the PR #13727 changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you! This PR is to add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one `path` in the `load` API. For example, users can see the `path` option after the following API call, ```Scala spark.read .format("parquet") .load("/test") ``` #### How was this patch tested? Added test cases. Author: gatorsmile <[email protected]> Closes #13933 from gatorsmile/optionPath.

Fixed and added tests

dcc4655

tdas reviewed Jun 17, 2016
View reviewed changes

tdas added 2 commits June 16, 2016 20:42

Fixed stuff

960048d

More harmonizing

3384473

Updated tests to check schema as well

3150b01

tdas changed the title ~~[SPARK-15982][SPARK-16009] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc~~ [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc Jun 17, 2016

Addressed comments

bb52410

marmbrus reviewed Jun 17, 2016
View reviewed changes

Updated tests to make them handle no-schema case

29524b1

tdas added 2 commits June 20, 2016 03:49

Fixed docs

24174f0

One more fix

3498bd0

Better indenting of bulleted sublist

2539a94

asfgit closed this in b99129c Jun 20, 2016

gatorsmile reviewed Jun 22, 2016
View reviewed changes

gatorsmile mentioned this pull request Jun 22, 2016

[SPARK-16126] [SQL] Better Error Message When using DataFrameReader without path #13837

Closed

koertkuipers reviewed Jun 27, 2016
View reviewed changes

gatorsmile mentioned this pull request Jun 28, 2016

[SPARK-16236] [SQL] Add Path Option back to Load API in DataFrameReader #13933

Closed

gatorsmile mentioned this pull request Jun 29, 2016

[SPARK-16236] [SQL] [FOLLOWUP] Add Path Option back to Load API in DataFrameReader #13965

Closed

[SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc #13727

[SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc #13727

Uh oh!

Conversation

tdas commented Jun 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

tdas commented Jun 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 17, 2016

Uh oh!

SparkQA commented Jun 17, 2016

Uh oh!

gatorsmile commented Jun 17, 2016

Uh oh!

tdas commented Jun 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 17, 2016

Uh oh!

SparkQA commented Jun 17, 2016

Uh oh!

gatorsmile commented Jun 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Jun 17, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

tdas commented Jun 20, 2016

Uh oh!

SparkQA commented Jun 20, 2016

Uh oh!

SparkQA commented Jun 20, 2016

Uh oh!

SparkQA commented Jun 20, 2016

Uh oh!

zsxwing commented Jun 20, 2016

Uh oh!

gatorsmile commented Jun 21, 2016

Uh oh!

tdas commented Jun 21, 2016

Uh oh!

gatorsmile commented Jun 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tdas commented Jun 17, 2016 •

edited

Loading

tdas commented Jun 17, 2016 •

edited

Loading