Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

We will throw an exception if bucket columns are part of partition columns, this should also apply to sort columns.

This PR also move the checking logic from DataFrameWriter to PreprocessTableCreation, which is the central place for checking and normailization.

How was this patch tested?

updated test.

@cloud-fan
Copy link
Contributor Author

cc @tejasapatil @gatorsmile

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72889 has finished for PR 16931 at commit 56b1b18.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@tejasapatil tejasapatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have a minor comment but its orthogonal to this PR.

n <- numBuckets
} yield {
numBuckets.map { n =>
require(n > 0 && n < 100000, "Bucket number must be greater than 0 and less than 100000.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orthogonal to your PR: This means Spark supports buckets in range [1, 99999]. Any reason to have a low value for upper bound ?

Also, I don't think this code gets executed if the bucketed table is written via SQL. The only check I can see was when we create BucketSpec but its for lower bound only :

. This check should be only present in BucketSpec creation to be consistent across the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea we should move this check to BucketSpec for consistency.

About the upper bound, we just picked a value that should be big enough. In practice I don't think users will set large bucket numbers, this is just a sanity check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I will submit a PR for that change once you land this one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you could do that change right here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to submit one :)

@SparkQA
Copy link

SparkQA commented Feb 15, 2017

Test build #72902 has finished for PR 16931 at commit e21d8ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Feb 15, 2017

LGTM

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@asfgit asfgit closed this in 8b75f8c Feb 15, 2017

test("write bucketed data with the overlapping bucketBy and partitionBy columns") {
intercept[AnalysisException](df.write
test("write bucketed data with the overlapping bucketBy/sortBy and partitionBy columns") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but I think we should move most test cases to sql packages. Let me try to do it. Only orc formats are hive only.

@gatorsmile
Copy link
Member

Another late LGTM

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 16, 2017
…artition columns

## What changes were proposed in this pull request?

We will throw an exception if bucket columns are part of partition columns, this should also apply to sort columns.

This PR also move the checking logic from `DataFrameWriter` to `PreprocessTableCreation`, which is the central place for checking and normailization.

## How was this patch tested?

updated test.

Author: Wenchen Fan <[email protected]>

Closes apache#16931 from cloud-fan/bucket.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants