[SPARK-19587][SQL] bucket sorting columns should not be picked from partition columns #16931

cloud-fan · 2017-02-14T19:22:33Z

What changes were proposed in this pull request?

We will throw an exception if bucket columns are part of partition columns, this should also apply to sort columns.

This PR also move the checking logic from DataFrameWriter to PreprocessTableCreation, which is the central place for checking and normailization.

How was this patch tested?

updated test.

cloud-fan · 2017-02-14T19:22:52Z

cc @tejasapatil @gatorsmile

SparkQA · 2017-02-14T21:14:31Z

Test build #72889 has finished for PR 16931 at commit 56b1b18.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil

LGTM. I have a minor comment but its orthogonal to this PR.

tejasapatil · 2017-02-14T23:28:01Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

-      n <- numBuckets
-    } yield {
+    numBuckets.map { n =>
      require(n > 0 && n < 100000, "Bucket number must be greater than 0 and less than 100000.")


Orthogonal to your PR: This means Spark supports buckets in range [1, 99999]. Any reason to have a low value for upper bound ?

Also, I don't think this code gets executed if the bucketed table is written via SQL. The only check I can see was when we create BucketSpec but its for lower bound only :

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

Line 138 in 4d4d0de

if (numBuckets <= 0) {

. This check should be only present in BucketSpec creation to be consistent across the codebase.

yea we should move this check to BucketSpec for consistency.

About the upper bound, we just picked a value that should be big enough. In practice I don't think users will set large bucket numbers, this is just a sanity check.

Cool. I will submit a PR for that change once you land this one

or you could do that change right here

feel free to submit one :)

SparkQA · 2017-02-15T00:55:14Z

Test build #72902 has finished for PR 16931 at commit e21d8ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-02-15T08:44:03Z

LGTM

cloud-fan · 2017-02-15T16:15:35Z

thanks for the review, merging to master!

gatorsmile · 2017-02-15T17:46:35Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/BucketedWriteSuite.scala


-  test("write bucketed data with the overlapping bucketBy and partitionBy columns") {
-    intercept[AnalysisException](df.write
+  test("write bucketed data with the overlapping bucketBy/sortBy and partitionBy columns") {


Not related to this PR, but I think we should move most test cases to sql packages. Let me try to do it. Only orc formats are hive only.

gatorsmile · 2017-02-15T17:46:54Z

Another late LGTM

…artition columns ## What changes were proposed in this pull request? We will throw an exception if bucket columns are part of partition columns, this should also apply to sort columns. This PR also move the checking logic from `DataFrameWriter` to `PreprocessTableCreation`, which is the central place for checking and normailization. ## How was this patch tested? updated test. Author: Wenchen Fan <[email protected]> Closes apache#16931 from cloud-fan/bucket.

bucket sorting columns should not be picked from partition columns

e21d8ae

cloud-fan force-pushed the bucket branch from 56b1b18 to e21d8ae Compare February 14, 2017 22:19

tejasapatil reviewed Feb 14, 2017

View reviewed changes

asfgit closed this in 8b75f8c Feb 15, 2017

gatorsmile reviewed Feb 15, 2017

View reviewed changes

tejasapatil mentioned this pull request Feb 16, 2017

[SPARK-19618][SQL] Inconsistency wrt max. buckets allowed from Dataframe API vs SQL #16948

Closed

[SPARK-19587][SQL] bucket sorting columns should not be picked from partition columns #16931

[SPARK-19587][SQL] bucket sorting columns should not be picked from partition columns #16931

Uh oh!

Conversation

cloud-fan commented Feb 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Feb 14, 2017

Uh oh!

SparkQA commented Feb 14, 2017

Uh oh!

tejasapatil left a comment

Choose a reason for hiding this comment

Uh oh!

tejasapatil Feb 14, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 15, 2017

Choose a reason for hiding this comment

Uh oh!

tejasapatil Feb 15, 2017

Choose a reason for hiding this comment

Uh oh!

tejasapatil Feb 15, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 15, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 15, 2017

Uh oh!

viirya commented Feb 15, 2017

Uh oh!

cloud-fan commented Feb 15, 2017

Uh oh!

gatorsmile Feb 15, 2017

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Feb 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants