[SPARK-12539][SQL] support writing bucketed table #10498

cloud-fan · 2015-12-28T14:28:58Z

This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example:

df.write.partitionBy("year").bucketBy(8, "country").sortBy("amount").saveAsTable("sales")

When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write.

Note that there may be multiply files for one bucket, as the data is distributed.

Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway.

Limitations:

Can't write bucketed data without hive metastore.
Can't insert bucketed data into existing hive tables.

cloud-fan · 2015-12-28T14:31:55Z

cc @yhuai @nongli

SparkQA · 2015-12-28T16:14:48Z

Test build #48367 has finished for PR 10498 at commit 8cb2494.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * case class Hash(children: Seq[Expression]) extends Expression\n * class TextOutputWriter(\n

yhuai · 2015-12-28T17:56:17Z

This one also includes #10435, right?

nongli · 2015-12-28T20:41:08Z

@cloud-fan

currently we don't shuffle before writing partitioned data, which means we will have same partition data in different RDD blocks, and that's why we have multi-files for one partition, and we will also have multi-files for one bucket, is that safe?

This is safe but how can we get in this state from a single write. There must have been a partitionBy before right?

hive support having bucketing without partitioning, should we support it?

Why not? If this hard to support?

rxin · 2015-12-28T21:51:19Z

BTW in github you can use square brackets to create a checklist, e.g.

- [ ] item a
- [ ] item b

becomes

item a
item b

cloud-fan · 2015-12-29T13:02:20Z

This one also includes #10435, we can merge that first.

SparkQA · 2015-12-29T17:51:46Z

Test build #48415 has finished for PR 10498 at commit a9dc997.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BucketSpec(

nongli · 2015-12-30T06:11:00Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

and the @SInCE annotations. and comments. These are public APIs

liancheng · 2015-12-30T15:21:50Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

This method can be simplified to:

if (sortingColumns.isDefined) { require(numBuckets.isDefined, "sortBy must be used together with bucketBy") } for { n <- numBuckets cols <- normalizedBucketCols } yield { require(n > 0, "Bucket number must be greater than 0.") BucketSpec(n, cols, normalizedSortCols) }

(require throws IllegalArgumentException when the condition is not met.)

SparkQA · 2015-12-30T16:28:03Z

Test build #48489 has finished for PR 10498 at commit d2dc9b3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-06T12:09:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala

I'm a little worried here. It was a simple != operator for non-bucket path before, but now it's a function call.

I think this is fine. The rest of this path is much more expensive than this function call.

SparkQA · 2016-01-06T12:32:21Z

Test build #48854 has finished for PR 10498 at commit d3200cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-06T12:46:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala

Shall we put the bucket id at the very last? i.e. after the file extension, so that it's much easier to get the bucket id given a file name. e.g. part-r-00009-ea518ad4-455a-4431-b471-d24e03814677.gz.parquet.00002

cc @nongli

I don't have a strong opinion here. Let's go either way for now and talk another pass before shipping this. We should try this with hive as well just to get another data point.

Having the bucket id at the very last could break some other applications, that rely on the file extension (to recognized the file format), so don't do that.

SparkQA · 2016-01-06T14:34:39Z

Test build #48856 has finished for PR 10498 at commit 1afd3ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-06T19:18:33Z

sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

I'd remove this TODO.

nongli · 2016-01-06T20:10:15Z

This looks good to me to merge.

davies · 2016-01-06T23:25:47Z

@cloud-fan Can we write bucketed table without partitions?

Just saw you have a test case for that, but didn't see you update the DefaultWriterContainer, how can that work?

davies · 2016-01-06T23:42:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONRelation.scala

minor question: Why do we support bucketing for json writer? The bucketing can only be recognized by Spark SQL, in this case, parquet is much more efficient.

As we embed CSV into Spark, I think we don't need to support bucketing for that as well.

just because we can... I felt it's cheap to add bucketing support for JSON so I went for it.

rxin · 2016-01-07T00:57:45Z

I'm going to merge this. @cloud-fan can you create a follow-up pr to address some of the comments above?

address comments in #10498 , especially #10498 (comment) Author: Wenchen Fan <[email protected]> This patch had conflicts when merged, resolved by Committer: Reynold Xin <[email protected]> Closes #10638 from cloud-fan/bucket-write.

l15k4 · 2016-05-21T14:54:58Z

Guys do you have a rough guess about when bucketing is to be implemented for org.apache.spark.sql.DataFrameWriter#save ?

infinitymittal · 2017-01-13T00:45:59Z

Hi,

There is the limitation of "Can't insert bucketed data into existing hive tables.". Do we have any plans to relax the same? I want to insert data using a query into an already existing table.

@cloud-fan Do we have a Jira for the same?

tejasapatil · 2017-01-13T18:39:39Z

@infinitymittal : See https://issues.apache.org/jira/browse/SPARK-17729

infinitymittal · 2017-01-13T18:43:11Z

@tejasapatil Thanks for the response. SPARK-17729 says "Spark still won't produce bucketed data as per Hive's bucketing guarantees". I want the data to be bucketed when written. Any further leads?

Just to be clear, even with "hive.enforce.bucketing" set to true, the data won't be written. Is that correct? Referencing pull request 15300's comments "Added test to ensure that INSERTs fail if strict bucket / sort is enforced".

tejasapatil · 2017-01-18T19:27:24Z

@infinitymittal : It will take time to have a fully functional support added. I had initiated a design proposal to get consensus on this could be done : https://issues.apache.org/jira/browse/SPARK-19256

In Spark, "hive.enforce.bucketing" is not respected. #15300 won't guarantee that the data written adheres to Hive's bucketing spec so approach taken there is to fail in user sets configs to enforce bucketing. This will avoid wrong data being written when user is expecting correct outputs after setting "hive.enforce.bucketing" to true. The longer term plan is to get rid of these configs and always write properly bucketed data (hive 2.x follows this model).

FelixKJose · 2020-03-17T02:44:03Z

@tejasapatil Is there any update on this regarding "always write properly bucketed data (hive 2.x follows this model)". Does spark provides this or your MR is ready to be merged into master?

write bucketed table

8cb2494

code refine

a9dc997

nongli reviewed Dec 30, 2015
View reviewed changes

cloud-fan added 2 commits December 30, 2015 22:58

add more tests

4c99698

add more comments

d2dc9b3

cloud-fan changed the title ~~[SPARK-12539][SQL][WIP] support writing bucketed table~~ [SPARK-12539][SQL] support writing bucketed table Dec 30, 2015

liancheng reviewed Dec 30, 2015
View reviewed changes

cloud-fan added 3 commits January 4, 2016 19:25

Merge remote-tracking branch 'origin/master' into bucket-write

b6d0a0b

address comments

ba23292

fix typo

21e0c48

cloud-fan added 2 commits January 6, 2016 15:05

Merge remote-tracking branch 'origin/master' into bucket-write

b4985d9

simplification

d3200cf

cloud-fan reviewed Jan 6, 2016
View reviewed changes

minor update

1afd3ee

cloud-fan force-pushed the bucket-write branch from 6f65e0d to 1afd3ee Compare January 6, 2016 12:44

cloud-fan reviewed Jan 6, 2016
View reviewed changes

rxin reviewed Jan 6, 2016
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

Copy link

Contributor

rxin Jan 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this TODO.

davies reviewed Jan 6, 2016
View reviewed changes

asfgit closed this in 917d3fc Jan 7, 2016

cloud-fan mentioned this pull request Jan 7, 2016

[SPARK-12539][follow-up] always sort in partitioning writer #10638

Closed

[SPARK-12539][SQL] support writing bucketed table #10498

[SPARK-12539][SQL] support writing bucketed table #10498

Uh oh!

Conversation

cloud-fan commented Dec 28, 2015

Uh oh!

cloud-fan commented Dec 28, 2015

Uh oh!

SparkQA commented Dec 28, 2015

Uh oh!

yhuai commented Dec 28, 2015

Uh oh!

nongli commented Dec 28, 2015

Uh oh!

rxin commented Dec 28, 2015

Uh oh!

cloud-fan commented Dec 29, 2015

Uh oh!

SparkQA commented Dec 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nongli commented Jan 6, 2016

Uh oh!

davies commented Jan 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jan 7, 2016

Uh oh!

l15k4 commented May 21, 2016

Uh oh!

infinitymittal commented Jan 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tejasapatil commented Jan 13, 2017

Uh oh!

infinitymittal commented Jan 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tejasapatil commented Jan 18, 2017

Uh oh!

FelixKJose commented Mar 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

infinitymittal commented Jan 13, 2017 •

edited

Loading

infinitymittal commented Jan 13, 2017 •

edited

Loading