-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17729] [SQL] Enable creating hive bucketed tables #15300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17729] [SQL] Enable creating hive bucketed tables #15300
Conversation
|
Test build #66112 has finished for PR 15300 at commit
|
|
cc @hvanhovell , @cloud-fan for review |
|
Test build #66140 has finished for PR 15300 at commit
|
| "currently does NOT populate bucketed output which is compatible with Hive." | ||
|
|
||
| if (hadoopConf.get(enforceBucketingConfig, "false").toBoolean || | ||
| hadoopConf.get(enforceSortingConfig, "false").toBoolean) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the default values (false) for these two configs safe? If user doesn't aware of it, it could insert non compatible data into bucketed Hive table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya : Even right now on trunk if you try to insert data into a bucketed table, it will just work w/o producing bucketed output. I don't want to break that for existing users by making these true. The eventual goal would be to not have these configs and Spark should always produce data adhering to the tables' bucketing spec (without breaking existing pipelines).
|
@hvanhovell , @cloud-fan : Can you please review this PR ? |
|
@hvanhovell , @cloud-fan : Can you please review this PR ? |
|
ok now I have time to work on it. do you have a plan/design for bucketed hive table? Because Spark and Hive have different hash implementations, we need a way to distinguish native bucketed hive table(use hive hash) and spark written bucketed hive table(use spark hash). |
|
@cloud-fan : Thanks for reaching out. I wanted ship this internally within Facebook for one of the internal use cases so didn't maintain / followup on this PR. I will put out a design over the weekend so that we can discuss it. We want to have all the changes pushed upstream so that everyone benefits and we don't have to maintain a fork. |
|
@cloud-fan : I have linked a proposal in https://issues.apache.org/jira/browse/SPARK-19256. |
|
since trunk had diverged a lot since this PR was created, closed this and created a fresh one at : #17644 |
What changes were proposed in this pull request?
Hive allows inserting data to bucketed table without guaranteeing bucketed and sorted-ness based on these two configs :
hive.enforce.bucketingandhive.enforce.sorting.What does this PR achieve ?
hive.enforce.bucketing=falseandhive.enforce.sorting=falsewhich means user does NOT care about bucketing guarantees.Changes done in this PR:
HiveClientImplMetastoreRelationnow populates the bucketing information in the hiveTableobjectInsertIntoHiveTableallows inserts to bucketed table only if bothhive.enforce.bucketingandhive.enforce.sortingarefalseAbility to create bucketed tables will enable adding test cases to Spark while I add pieces to make Spark support hive bucketing (eg. #15229, #15047, #15040)
How was this patch tested?
SHOW CREATE TABLEworks for hive bucketed tables