-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19587][SQL] bucket sorting columns should not be picked from partition columns #16931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #72889 has finished for PR 16931 at commit
|
tejasapatil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I have a minor comment but its orthogonal to this PR.
| n <- numBuckets | ||
| } yield { | ||
| numBuckets.map { n => | ||
| require(n > 0 && n < 100000, "Bucket number must be greater than 0 and less than 100000.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orthogonal to your PR: This means Spark supports buckets in range [1, 99999]. Any reason to have a low value for upper bound ?
Also, I don't think this code gets executed if the bucketed table is written via SQL. The only check I can see was when we create BucketSpec but its for lower bound only :
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
Line 138 in 4d4d0de
| if (numBuckets <= 0) { |
BucketSpec creation to be consistent across the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea we should move this check to BucketSpec for consistency.
About the upper bound, we just picked a value that should be big enough. In practice I don't think users will set large bucket numbers, this is just a sanity check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I will submit a PR for that change once you land this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or you could do that change right here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to submit one :)
|
Test build #72902 has finished for PR 16931 at commit
|
|
LGTM |
|
thanks for the review, merging to master! |
|
|
||
| test("write bucketed data with the overlapping bucketBy and partitionBy columns") { | ||
| intercept[AnalysisException](df.write | ||
| test("write bucketed data with the overlapping bucketBy/sortBy and partitionBy columns") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but I think we should move most test cases to sql packages. Let me try to do it. Only orc formats are hive only.
|
Another late LGTM |
…artition columns ## What changes were proposed in this pull request? We will throw an exception if bucket columns are part of partition columns, this should also apply to sort columns. This PR also move the checking logic from `DataFrameWriter` to `PreprocessTableCreation`, which is the central place for checking and normailization. ## How was this patch tested? updated test. Author: Wenchen Fan <[email protected]> Closes apache#16931 from cloud-fan/bucket.
What changes were proposed in this pull request?
We will throw an exception if bucket columns are part of partition columns, this should also apply to sort columns.
This PR also move the checking logic from
DataFrameWritertoPreprocessTableCreation, which is the central place for checking and normailization.How was this patch tested?
updated test.