Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented Jun 17, 2016

What changes were proposed in this pull request?

Issues with current reader behavior.

  • text() without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with value string field,
  • textFile() without args fails with exception because of the above reason, it expected the DF returned by text() to have a value field.
  • orc() does not have var args, inconsistent with others
  • json(single-arg) was removed, but that caused source compatibility issues - SPARK-16009
  • user specified schema was not respected when text/csv/... were used with no args - SPARK-16007

The solution I am implementing is to do the following.

  • For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs).
  • Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string)
  • Deduped docs and fixed their formatting.

How was this patch tested?

Added new unit tests for Scala and Java tests

@tdas
Copy link
Contributor Author

tdas commented Jun 17, 2016

@marmbrus @rxin

* @since 1.4.0
*/
def load(): DataFrame = {
val dataSource =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deduped.

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60679 has finished for PR 13727 at commit 960048d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60683 has finished for PR 13727 at commit 3384473.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

The following test case is for SPARK-16007. I can confirm this PR already fixes the issue.

  test("schema checking") {
    val schema: StructType = new StructType().add("s", "string")
    assert(spark.read.schema(schema).csv().schema === schema)
    assert(spark.read.schema(schema).json().schema === schema)
    assert(spark.read.schema(schema).parquet().schema === schema)
    assert(spark.read.schema(schema).text().schema === schema)
    assert(spark.read.schema(schema).orc().schema === schema)
  }

Since the ORC data source must be used with Hive support enabled, you can comment the last line out. Or move it to another test case in Hive.

Please let me know if anything is needed. Thanks!

@tdas tdas changed the title [SPARK-15982][SPARK-16009] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc [SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc Jun 17, 2016
@tdas
Copy link
Contributor Author

tdas commented Jun 17, 2016

@gatorsmile Thanks for testing this! I updated my tests in this PR to rigorously test these. However since this PR is about testing the common behavior (e.g. whether all of them respect user schema or not), I have not added tests for the case where there are no paths AND no user schema. The behavior will be source-specific - parquet/json/csv should throw error, whereas text should not. Could you make a PR that tests these in CSVSuite, etc. if they are not already tested?

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60696 has finished for PR 13727 at commit bb52410.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60695 has finished for PR 13727 at commit 3150b01.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@tdas Sure, will do it soon. I might submit it after this is merged. Thanks!

// This method ensures that calls that explicit need single argument works, see SPARK-16009
csv(Seq(path): _*)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be addressed in a follow up, but I don't think we should duplicate docs cause they are going to get out of sync. I'd have one canonical one and the other link to it.

@marmbrus
Copy link
Contributor

A few comments. Overall LGTM.

@SparkQA
Copy link

SparkQA commented Jun 18, 2016

Test build #60747 has finished for PR 13727 at commit 29524b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor Author

tdas commented Jun 20, 2016

@marmbrus I deduped the docs. Linking to specific method using scaladocs was hard to use and did not work in Java docs. So I just wrote it "See docs on other overloaded method". And I fixed the doc formatting.

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60846 has finished for PR 13727 at commit 24174f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60847 has finished for PR 13727 at commit 3498bd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 20, 2016

Test build #60850 has finished for PR 13727 at commit 2539a94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing
Copy link
Member

zsxwing commented Jun 20, 2016

LGTM. Merging to master and 2.0. Thanks!

@asfgit asfgit closed this in b99129c Jun 20, 2016
asfgit pushed a commit that referenced this pull request Jun 20, 2016
…f DataFrameReader.text/csv/json/parquet/orc

## What changes were proposed in this pull request?

Issues with current reader behavior.
- `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field,
- `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field.
- `orc()` does not have var args, inconsistent with others
- `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009)
- user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007)

The solution I am implementing is to do the following.
- For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs).
- Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string)
- Deduped docs and fixed their formatting.

## How was this patch tested?
Added new unit tests for Scala and Java tests

Author: Tathagata Das <[email protected]>

Closes #13727 from tdas/SPARK-15982.

(cherry picked from commit b99129c)
Signed-off-by: Shixiong Zhu <[email protected]>
@gatorsmile
Copy link
Member

Tonight, will submit a PR for test cases. Thanks!

@tdas
Copy link
Contributor Author

tdas commented Jun 21, 2016

@gatorsmile Actually, I already added tests in this PR that cover the scenario where schema is not provided. The only one that is not really tested in the orc, as it cannot be run in the DataFrameReaderWriterSuite. So could you add that test in the ORC-related test suites, if at all needed?

@gatorsmile
Copy link
Member

@tdas Sure, I have multiple related test cases I wrote a few days ago. It might be also useful. You can judge whether they are needed or not. Let me merge your latest changes. : )

// from user schema
val expData = Seq[String](null, null, null)
testRead(spark.read.schema(userSchema).parquet(), Seq.empty, userSchema)
testRead(spark.read.schema(userSchema).parquet(dir), expData, userSchema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdas ORC behaves differently. When the user-specified schema does not match the physical schema, it simply stops and reports an exception. Do you think that behavior is better than returning null for all the rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's documents that as a test for now.
On Jun 21, 2016 7:51 PM, "Xiao Li" [email protected] wrote:

In
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
#13727 (comment):

  • // Reader, without user specified schema
  • intercept[AnalysisException] {
  •  testRead(spark.read.parquet(), Seq.empty, schema)
    
  • }
  • testRead(spark.read.parquet(dir), data, schema)
  • testRead(spark.read.parquet(dir, dir), data ++ data, schema)
  • testRead(spark.read.parquet(Seq(dir, dir): _*), data ++ data, schema)
  • // Test explicit calls to single arg method - SPARK-16009
  • testRead(Option(dir).map(spark.read.parquet).get, data, schema)
  • // Reader, with user specified schema, data should be nulls as schema in file different
  • // from user schema
  • val expData = Seq[String](null, null, null)
  • testRead(spark.read.schema(userSchema).parquet(), Seq.empty, userSchema)
  • testRead(spark.read.schema(userSchema).parquet(dir), expData, userSchema)

@tdas https://github.com/tdas ORC behaves differently. When the
user-specified schema does not match the physical schema, it simply stops
and reports an exception. Do you think that behavior is better than
returning null for all the rows?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/13727/files/2539a947d382d8d6c21c59fe8a9420a15aad9b9a#r67987384,
or mute the thread
https://github.com/notifications/unsubscribe/AAoerKNkJhd5r8QzCyrTjIY8aMnl9uVZks5qOKMygaJpZM4I4AVc
.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do it. Thanks!

*/
def load(path: String): DataFrame = {
option("path", path).load()
load(Seq(path): _*) // force invocation of `load(...varargs...)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this change path is no longer available in the options. this makes it hard (impossible?) for non-file based DataSources (not implementing FileFormat) to use load(...)

For example for elasticsearch we use:

sqlContext.read.format("org.elasticsearch.spark.sql").load(resource)

i do not think this can be implemented anymore now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlContext.read.option("path", resource).format("org.elasticsearch.spark.sql").load()

Can you try this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe that works as expected (i am running into some other issues now, but they seem unrelated).
however from a DSL perspective this is not very pretty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will also break users code in an upgrade

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea this is a bad breaking change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to fix it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do it soon. Thanks!

asfgit pushed a commit that referenced this pull request Jun 28, 2016
#### What changes were proposed in this pull request?
koertkuipers identified the PR #13727 changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you!

This PR is to add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one `path` in the `load` API. For example, users can see the `path` option after the following API call,
```Scala
spark.read
  .format("parquet")
  .load("/test")
```

#### How was this patch tested?
Added test cases.

Author: gatorsmile <[email protected]>

Closes #13933 from gatorsmile/optionPath.

(cherry picked from commit 25520e9)
Signed-off-by: Reynold Xin <[email protected]>
asfgit pushed a commit that referenced this pull request Jun 28, 2016
#### What changes were proposed in this pull request?
koertkuipers identified the PR #13727 changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you!

This PR is to add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one `path` in the `load` API. For example, users can see the `path` option after the following API call,
```Scala
spark.read
  .format("parquet")
  .load("/test")
```

#### How was this patch tested?
Added test cases.

Author: gatorsmile <[email protected]>

Closes #13933 from gatorsmile/optionPath.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants