Skip to content

Conversation

@tejasapatil
Copy link
Contributor

@tejasapatil tejasapatil commented Sep 29, 2016

What changes were proposed in this pull request?

Hive allows inserting data to bucketed table without guaranteeing bucketed and sorted-ness based on these two configs : hive.enforce.bucketing and hive.enforce.sorting.

What does this PR achieve ?

  • Spark will disallow users from writing outputs to hive bucketed tables by default (given that output won't adhere with Hive's semantics).
  • IF user still wants to write to hive bucketed table, the only resort is to use hive.enforce.bucketing=false and hive.enforce.sorting=false which means user does NOT care about bucketing guarantees.

Changes done in this PR:

  • Extract table's bucketing information in HiveClientImpl
  • While writing table info to metastore, MetastoreRelation now populates the bucketing information in the hive Table object
  • InsertIntoHiveTable allows inserts to bucketed table only if both hive.enforce.bucketing and hive.enforce.sorting are false

Ability to create bucketed tables will enable adding test cases to Spark while I add pieces to make Spark support hive bucketing (eg. #15229, #15047, #15040)

How was this patch tested?

  • Added test for creating bucketed and sorted table.
  • Added test to ensure that INSERTs fail if strict bucket / sort is enforced
  • Added test to ensure that INSERTs can go through if strict bucket / sort is NOT enforced
  • Added test to validate that bucketing information shows up in output of DESC FORMATTED
  • Added test to ensure that SHOW CREATE TABLE works for hive bucketed tables

@SparkQA
Copy link

SparkQA commented Sep 29, 2016

Test build #66112 has finished for PR 15300 at commit 4b0b7b4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

cc @hvanhovell , @cloud-fan for review

@SparkQA
Copy link

SparkQA commented Sep 30, 2016

Test build #66140 has finished for PR 15300 at commit 1369332.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"currently does NOT populate bucketed output which is compatible with Hive."

if (hadoopConf.get(enforceBucketingConfig, "false").toBoolean ||
hadoopConf.get(enforceSortingConfig, "false").toBoolean) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the default values (false) for these two configs safe? If user doesn't aware of it, it could insert non compatible data into bucketed Hive table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya : Even right now on trunk if you try to insert data into a bucketed table, it will just work w/o producing bucketed output. I don't want to break that for existing users by making these true. The eventual goal would be to not have these configs and Spark should always produce data adhering to the tables' bucketing spec (without breaking existing pipelines).

@tejasapatil
Copy link
Contributor Author

@hvanhovell , @cloud-fan : Can you please review this PR ?

@tejasapatil
Copy link
Contributor Author

@hvanhovell , @cloud-fan : Can you please review this PR ?

@cloud-fan
Copy link
Contributor

ok now I have time to work on it. do you have a plan/design for bucketed hive table? Because Spark and Hive have different hash implementations, we need a way to distinguish native bucketed hive table(use hive hash) and spark written bucketed hive table(use spark hash).

@tejasapatil
Copy link
Contributor Author

@cloud-fan : Thanks for reaching out. I wanted ship this internally within Facebook for one of the internal use cases so didn't maintain / followup on this PR. I will put out a design over the weekend so that we can discuss it. We want to have all the changes pushed upstream so that everyone benefits and we don't have to maintain a fork.

@tejasapatil
Copy link
Contributor Author

@cloud-fan : I have linked a proposal in https://issues.apache.org/jira/browse/SPARK-19256.

@tejasapatil tejasapatil deleted the SPARK-17729_create_bucketed_table branch April 15, 2017 23:43
@tejasapatil
Copy link
Contributor Author

since trunk had diverged a lot since this PR was created, closed this and created a fresh one at : #17644

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants