-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27453] Pass partitionBy as options in DataFrameWriter #24365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Add to whitelist |
|
test this please |
|
Test build #104568 has finished for PR 24365 at commit
|
| .internal() | ||
| .doc("Whether to pass the partitionBy columns as options in DataFrameWriter." + | ||
| " Data source V1 now silently drops partitionBy columns for non-file-format sources;" + | ||
| " turning the flag on provides a way for these sources to see these partitionBy columns.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the format in this file should be "text " + "more text" instead of "text" + " more text"
| " Data source V1 now silently drops partitionBy columns for non-file-format sources;" + | ||
| " turning the flag on provides a way for these sources to see these partitionBy columns.") | ||
| .booleanConf | ||
| .createWithDefault(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the default be false to be backward compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on what to recommend here. I think we need to balance the fact that this is a behavior change with the confusion that a user will experience when options they specify are silently dropped.
I recommended going with true because: a) I don't know any V1 sources that validate options where this change would break an existing program and b) the only time behavior changes is when someone is specifying a (silently dropped) partitionBy clause.
Thoughts?
|
test this please |
|
jenkins retest this please |
|
retest this please |
|
Add to whitelist |
|
Jenkins, add to whitelist |
|
|
||
| val LEGACY_PASS_PARTITION_BY_AS_OPTIONS = | ||
| buildConf("spark.sql.legacy.sources.write.passPartitionByAsOptions") | ||
| .internal() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this must be internal?
I see some of the previous ones with LEGACY_ prefix are internal but before them there are a few externals: LEGACY_REPLACE_DATABRICKS_SPARK_AVRO_ENABLED and LEGACY_SIZE_OF_NULL. So what is the reason defining this as internal?
I assume if we expect the users to set it it then it must be external.
What is your (and the others) opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to document this flag for users. This is just in case we break some existing production workloads. For other users, they don't need to know anything about this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see. Thanks for the explanation.
|
Test build #104633 has finished for PR 24365 at commit
|
|
LGTM. Merging to master and 2.4 |
Pass partitionBy columns as options and feature-flag this behavior. A new unit test. Closes #24365 from liwensun/partitionby. Authored-by: liwensun <[email protected]> Signed-off-by: Tathagata Das <[email protected]> (cherry picked from commit 26ed65f) Signed-off-by: Tathagata Das <[email protected]>
## What changes were proposed in this pull request? Pass partitionBy columns as options and feature-flag this behavior. ## How was this patch tested? A new unit test. Closes apache#24365 from liwensun/partitionby. Authored-by: liwensun <[email protected]> Signed-off-by: Tathagata Das <[email protected]>
…TIONS ## What changes were proposed in this pull request? In PR apache#24365, we pass in the partitionBy columns as options in `DataFrameWriter`. To make this change less intrusive for a patch release, we added a feature flag `LEGACY_PASS_PARTITION_BY_AS_OPTIONS` with the default to be false. For 3.0, we should just do the correct behavior for DSV1, i.e., always passing partitionBy as options, and remove this legacy feature flag. ## How was this patch tested? Existing tests. Closes apache#24784 from liwensun/SPARK-27453-default. Authored-by: liwensun <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
Pass partitionBy columns as options and feature-flag this behavior. A new unit test. Closes apache#24365 from liwensun/partitionby. Authored-by: liwensun <[email protected]> Signed-off-by: Tathagata Das <[email protected]> (cherry picked from commit 26ed65f) Signed-off-by: Tathagata Das <[email protected]>
Pass partitionBy columns as options and feature-flag this behavior. A new unit test. Closes apache#24365 from liwensun/partitionby. Authored-by: liwensun <[email protected]> Signed-off-by: Tathagata Das <[email protected]> (cherry picked from commit 26ed65f) Signed-off-by: Tathagata Das <[email protected]>
Pass partitionBy columns as options and feature-flag this behavior. A new unit test. Closes apache#24365 from liwensun/partitionby. Authored-by: liwensun <[email protected]> Signed-off-by: Tathagata Das <[email protected]> (cherry picked from commit 26ed65f) Signed-off-by: Tathagata Das <[email protected]>
What changes were proposed in this pull request?
Pass partitionBy columns as options and feature-flag this behavior.
How was this patch tested?
A new unit test.