-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15982][SPARK-16009][SPARK-16007][SQL] Harmonize the behavior of DataFrameReader.text/csv/json/parquet/orc #13727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * @since 1.4.0 | ||
| */ | ||
| def load(): DataFrame = { | ||
| val dataSource = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deduped.
|
Test build #60679 has finished for PR 13727 at commit
|
|
Test build #60683 has finished for PR 13727 at commit
|
|
The following test case is for SPARK-16007. I can confirm this PR already fixes the issue. test("schema checking") {
val schema: StructType = new StructType().add("s", "string")
assert(spark.read.schema(schema).csv().schema === schema)
assert(spark.read.schema(schema).json().schema === schema)
assert(spark.read.schema(schema).parquet().schema === schema)
assert(spark.read.schema(schema).text().schema === schema)
assert(spark.read.schema(schema).orc().schema === schema)
}Since the ORC data source must be used with Hive support enabled, you can comment the last line out. Or move it to another test case in Hive. Please let me know if anything is needed. Thanks! |
|
@gatorsmile Thanks for testing this! I updated my tests in this PR to rigorously test these. However since this PR is about testing the common behavior (e.g. whether all of them respect user schema or not), I have not added tests for the case where there are no paths AND no user schema. The behavior will be source-specific - |
|
Test build #60696 has finished for PR 13727 at commit
|
|
Test build #60695 has finished for PR 13727 at commit
|
|
@tdas Sure, will do it soon. I might submit it after this is merged. Thanks! |
| // This method ensures that calls that explicit need single argument works, see SPARK-16009 | ||
| csv(Seq(path): _*) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be addressed in a follow up, but I don't think we should duplicate docs cause they are going to get out of sync. I'd have one canonical one and the other link to it.
|
A few comments. Overall LGTM. |
|
Test build #60747 has finished for PR 13727 at commit
|
|
@marmbrus I deduped the docs. Linking to specific method using scaladocs was hard to use and did not work in Java docs. So I just wrote it "See docs on other overloaded method". And I fixed the doc formatting. |
|
Test build #60846 has finished for PR 13727 at commit
|
|
Test build #60847 has finished for PR 13727 at commit
|
|
Test build #60850 has finished for PR 13727 at commit
|
|
LGTM. Merging to master and 2.0. Thanks! |
…f DataFrameReader.text/csv/json/parquet/orc ## What changes were proposed in this pull request? Issues with current reader behavior. - `text()` without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF with `value` string field, - `textFile()` without args fails with exception because of the above reason, it expected the DF returned by `text()` to have a `value` field. - `orc()` does not have var args, inconsistent with others - `json(single-arg)` was removed, but that caused source compatibility issues - [SPARK-16009](https://issues.apache.org/jira/browse/SPARK-16009) - user specified schema was not respected when `text/csv/...` were used with no args - [SPARK-16007](https://issues.apache.org/jira/browse/SPARK-16007) The solution I am implementing is to do the following. - For each format, there will be a single argument method, and a vararg method. For json, parquet, csv, text, this means adding json(string), etc.. For orc, this means adding orc(varargs). - Remove the special handling of text(), csv(), etc. that returns empty dataframe with no fields. Rather pass on the empty sequence of paths to the datasource, and let each datasource handle it right. For e.g, text data source, should return empty DF with schema (value: string) - Deduped docs and fixed their formatting. ## How was this patch tested? Added new unit tests for Scala and Java tests Author: Tathagata Das <[email protected]> Closes #13727 from tdas/SPARK-15982. (cherry picked from commit b99129c) Signed-off-by: Shixiong Zhu <[email protected]>
|
Tonight, will submit a PR for test cases. Thanks! |
|
@gatorsmile Actually, I already added tests in this PR that cover the scenario where schema is not provided. The only one that is not really tested in the orc, as it cannot be run in the DataFrameReaderWriterSuite. So could you add that test in the ORC-related test suites, if at all needed? |
|
@tdas Sure, I have multiple related test cases I wrote a few days ago. It might be also useful. You can judge whether they are needed or not. Let me merge your latest changes. : ) |
| // from user schema | ||
| val expData = Seq[String](null, null, null) | ||
| testRead(spark.read.schema(userSchema).parquet(), Seq.empty, userSchema) | ||
| testRead(spark.read.schema(userSchema).parquet(dir), expData, userSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdas ORC behaves differently. When the user-specified schema does not match the physical schema, it simply stops and reports an exception. Do you think that behavior is better than returning null for all the rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's documents that as a test for now.
On Jun 21, 2016 7:51 PM, "Xiao Li" [email protected] wrote:
In
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
#13727 (comment):
- // Reader, without user specified schema
- intercept[AnalysisException] {
testRead(spark.read.parquet(), Seq.empty, schema)- }
- testRead(spark.read.parquet(dir), data, schema)
- testRead(spark.read.parquet(dir, dir), data ++ data, schema)
- testRead(spark.read.parquet(Seq(dir, dir): _*), data ++ data, schema)
- // Test explicit calls to single arg method - SPARK-16009
- testRead(Option(dir).map(spark.read.parquet).get, data, schema)
- // Reader, with user specified schema, data should be nulls as schema in file different
- // from user schema
- val expData = Seq[String](null, null, null)
- testRead(spark.read.schema(userSchema).parquet(), Seq.empty, userSchema)
- testRead(spark.read.schema(userSchema).parquet(dir), expData, userSchema)
@tdas https://github.com/tdas ORC behaves differently. When the
user-specified schema does not match the physical schema, it simply stops
and reports an exception. Do you think that behavior is better than
returning null for all the rows?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/apache/spark/pull/13727/files/2539a947d382d8d6c21c59fe8a9420a15aad9b9a#r67987384,
or mute the thread
https://github.com/notifications/unsubscribe/AAoerKNkJhd5r8QzCyrTjIY8aMnl9uVZks5qOKMygaJpZM4I4AVc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do it. Thanks!
| */ | ||
| def load(path: String): DataFrame = { | ||
| option("path", path).load() | ||
| load(Seq(path): _*) // force invocation of `load(...varargs...)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this change path is no longer available in the options. this makes it hard (impossible?) for non-file based DataSources (not implementing FileFormat) to use load(...)
For example for elasticsearch we use:
sqlContext.read.format("org.elasticsearch.spark.sql").load(resource)
i do not think this can be implemented anymore now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlContext.read.option("path", resource).format("org.elasticsearch.spark.sql").load()Can you try this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe that works as expected (i am running into some other issues now, but they seem unrelated).
however from a DSL perspective this is not very pretty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will also break users code in an upgrade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea this is a bad breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do it soon. Thanks!
#### What changes were proposed in this pull request? koertkuipers identified the PR #13727 changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you! This PR is to add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one `path` in the `load` API. For example, users can see the `path` option after the following API call, ```Scala spark.read .format("parquet") .load("/test") ``` #### How was this patch tested? Added test cases. Author: gatorsmile <[email protected]> Closes #13933 from gatorsmile/optionPath. (cherry picked from commit 25520e9) Signed-off-by: Reynold Xin <[email protected]>
#### What changes were proposed in this pull request? koertkuipers identified the PR #13727 changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you! This PR is to add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one `path` in the `load` API. For example, users can see the `path` option after the following API call, ```Scala spark.read .format("parquet") .load("/test") ``` #### How was this patch tested? Added test cases. Author: gatorsmile <[email protected]> Closes #13933 from gatorsmile/optionPath.
What changes were proposed in this pull request?
Issues with current reader behavior.
text()without args returns an empty DF with no columns -> inconsistent, its expected that text will always return a DF withvaluestring field,textFile()without args fails with exception because of the above reason, it expected the DF returned bytext()to have avaluefield.orc()does not have var args, inconsistent with othersjson(single-arg)was removed, but that caused source compatibility issues - SPARK-16009text/csv/...were used with no args - SPARK-16007The solution I am implementing is to do the following.
How was this patch tested?
Added new unit tests for Scala and Java tests