-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-38237][SQL][SS] Introduce a new config to require all cluster keys on Aggregate #35552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -407,6 +407,16 @@ object SQLConf { | |
| .booleanConf | ||
| .createWithDefault(true) | ||
|
|
||
| val REQUIRE_ALL_CLUSTER_KEYS_FOR_AGGREGATE = | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I picked up the similar config name with similar description in above config ( |
||
| buildConf("spark.sql.aggregate.requireAllClusterKeys") | ||
| .internal() | ||
| .doc("When true, aggregate operator requires all the clustering keys as the hash partition" + | ||
| " keys from child. This is to avoid data skews which can lead to significant " + | ||
| "performance regression if shuffles are eliminated.") | ||
| .version("3.3.0") | ||
| .booleanConf | ||
| .createWithDefault(false) | ||
|
|
||
| val RADIX_SORT_ENABLED = buildConf("spark.sql.sort.enableRadixSort") | ||
| .internal() | ||
| .doc("When true, enable use of radix sort when possible. Radix sort is much faster but " + | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -287,6 +287,9 @@ abstract class StreamExecution( | |
| // Disable cost-based join optimization as we do not want stateful operations | ||
| // to be rearranged | ||
| sparkSessionForStream.conf.set(SQLConf.CBO_ENABLED.key, "false") | ||
| // Disable any config affecting the required child distribution of stateful operators. | ||
| // Please read through the NOTE on the classdoc of HashClusteredDistribution for details. | ||
| sparkSessionForStream.conf.set(SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_AGGREGATE.key, "false") | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is super important. The new config should never be set to true before we fix the fundamental problem with considering backward compatibility, since stateful operator would follow the changed output partitioning as well. |
||
|
|
||
| updateStatusMessage("Initializing sources") | ||
| // force initialization of the logical plan so that the sources can be created | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and below lines basically restore the implementation of HashClusteredDistribution.