[SPARK-17729] [SQL] Enable creating hive bucketed tables #15300

tejasapatil · 2016-09-29T17:55:39Z

What changes were proposed in this pull request?

Hive allows inserting data to bucketed table without guaranteeing bucketed and sorted-ness based on these two configs : hive.enforce.bucketing and hive.enforce.sorting.

What does this PR achieve ?

Spark will disallow users from writing outputs to hive bucketed tables by default (given that output won't adhere with Hive's semantics).
IF user still wants to write to hive bucketed table, the only resort is to use hive.enforce.bucketing=false and hive.enforce.sorting=false which means user does NOT care about bucketing guarantees.

Changes done in this PR:

Extract table's bucketing information in HiveClientImpl
While writing table info to metastore, MetastoreRelation now populates the bucketing information in the hive Table object
InsertIntoHiveTable allows inserts to bucketed table only if both hive.enforce.bucketing and hive.enforce.sorting are false

Ability to create bucketed tables will enable adding test cases to Spark while I add pieces to make Spark support hive bucketing (eg. #15229, #15047, #15040)

How was this patch tested?

Added test for creating bucketed and sorted table.
Added test to ensure that INSERTs fail if strict bucket / sort is enforced
Added test to ensure that INSERTs can go through if strict bucket / sort is NOT enforced
Added test to validate that bucketing information shows up in output of DESC FORMATTED
Added test to ensure that SHOW CREATE TABLE works for hive bucketed tables

SparkQA · 2016-09-29T19:17:29Z

Test build #66112 has finished for PR 15300 at commit 4b0b7b4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-09-30T00:57:30Z

cc @hvanhovell , @cloud-fan for review

SparkQA · 2016-09-30T02:54:25Z

Test build #66140 has finished for PR 15300 at commit 1369332.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-03T07:21:23Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+          "currently does NOT populate bucketed output which is compatible with Hive."
+
+        if (hadoopConf.get(enforceBucketingConfig, "false").toBoolean ||
+          hadoopConf.get(enforceSortingConfig, "false").toBoolean) {


Are the default values (false) for these two configs safe? If user doesn't aware of it, it could insert non compatible data into bucketed Hive table.

@viirya : Even right now on trunk if you try to insert data into a bucketed table, it will just work w/o producing bucketed output. I don't want to break that for existing users by making these true. The eventual goal would be to not have these configs and Spark should always produce data adhering to the tables' bucketing spec (without breaking existing pipelines).

tejasapatil · 2016-10-06T22:40:19Z

@hvanhovell , @cloud-fan : Can you please review this PR ?

tejasapatil · 2016-10-26T00:52:01Z

@hvanhovell , @cloud-fan : Can you please review this PR ?

cloud-fan · 2017-01-14T03:55:52Z

ok now I have time to work on it. do you have a plan/design for bucketed hive table? Because Spark and Hive have different hash implementations, we need a way to distinguish native bucketed hive table(use hive hash) and spark written bucketed hive table(use spark hash).

tejasapatil · 2017-01-14T04:26:21Z

@cloud-fan : Thanks for reaching out. I wanted ship this internally within Facebook for one of the internal use cases so didn't maintain / followup on this PR. I will put out a design over the weekend so that we can discuss it. We want to have all the changes pushed upstream so that everyone benefits and we don't have to maintain a fork.

tejasapatil · 2017-01-17T05:58:29Z

@cloud-fan : I have linked a proposal in https://issues.apache.org/jira/browse/SPARK-19256.

tejasapatil · 2017-04-15T23:46:16Z

since trunk had diverged a lot since this PR was created, closed this and created a fresh one at : #17644

Enable creating hive bucketed tables

4b0b7b4

tejasapatil mentioned this pull request Sep 29, 2016

[SPARK-17654] [SQL] Propagate bucketing information for Hive tables to planner #15229

Closed

fix test case failures

1369332

viirya reviewed Oct 3, 2016

View reviewed changes

tejasapatil mentioned this pull request Jan 18, 2017

[SPARK-12539][SQL] support writing bucketed table #10498

Closed

tejasapatil closed this Apr 15, 2017

tejasapatil deleted the SPARK-17729_create_bucketed_table branch April 15, 2017 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17729] [SQL] Enable creating hive bucketed tables #15300

[SPARK-17729] [SQL] Enable creating hive bucketed tables #15300

Uh oh!

tejasapatil commented Sep 29, 2016 •

edited

Loading

Uh oh!

SparkQA commented Sep 29, 2016

Uh oh!

tejasapatil commented Sep 30, 2016

Uh oh!

SparkQA commented Sep 30, 2016

Uh oh!

viirya Oct 3, 2016

Uh oh!

tejasapatil Oct 3, 2016

Uh oh!

tejasapatil commented Oct 6, 2016

Uh oh!

tejasapatil commented Oct 26, 2016

Uh oh!

cloud-fan commented Jan 14, 2017

Uh oh!

tejasapatil commented Jan 14, 2017

Uh oh!

tejasapatil commented Jan 17, 2017

Uh oh!

tejasapatil commented Apr 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-17729] [SQL] Enable creating hive bucketed tables #15300

[SPARK-17729] [SQL] Enable creating hive bucketed tables #15300

Uh oh!

Conversation

tejasapatil commented Sep 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 29, 2016

Uh oh!

tejasapatil commented Sep 30, 2016

Uh oh!

SparkQA commented Sep 30, 2016

Uh oh!

viirya Oct 3, 2016

Choose a reason for hiding this comment

Uh oh!

tejasapatil Oct 3, 2016

Choose a reason for hiding this comment

Uh oh!

tejasapatil commented Oct 6, 2016

Uh oh!

tejasapatil commented Oct 26, 2016

Uh oh!

cloud-fan commented Jan 14, 2017

Uh oh!

tejasapatil commented Jan 14, 2017

Uh oh!

tejasapatil commented Jan 17, 2017

Uh oh!

tejasapatil commented Apr 15, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tejasapatil commented Sep 29, 2016 •

edited

Loading