-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-12539][SQL] support writing bucketed table #10498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #48367 has finished for PR 10498 at commit
|
|
This one also includes #10435, right? |
This is safe but how can we get in this state from a single write. There must have been a partitionBy before right?
Why not? If this hard to support? |
|
BTW in github you can use square brackets to create a checklist, e.g. becomes
|
|
This one also includes #10435, we can merge that first. |
|
Test build #48415 has finished for PR 10498 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the @SInCE annotations. and comments. These are public APIs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can be simplified to:
if (sortingColumns.isDefined) {
require(numBuckets.isDefined, "sortBy must be used together with bucketBy")
}
for {
n <- numBuckets
cols <- normalizedBucketCols
} yield {
require(n > 0, "Bucket number must be greater than 0.")
BucketSpec(n, cols, normalizedSortCols)
}(require throws IllegalArgumentException when the condition is not met.)
|
Test build #48489 has finished for PR 10498 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried here. It was a simple != operator for non-bucket path before, but now it's a function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. The rest of this path is much more expensive than this function call.
|
Test build #48854 has finished for PR 10498 at commit
|
6f65e0d to
1afd3ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we put the bucket id at the very last? i.e. after the file extension, so that it's much easier to get the bucket id given a file name. e.g. part-r-00009-ea518ad4-455a-4431-b471-d24e03814677.gz.parquet.00002
cc @nongli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion here. Let's go either way for now and talk another pass before shipping this. We should try this with hive as well just to get another data point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the bucket id at the very last could break some other applications, that rely on the file extension (to recognized the file format), so don't do that.
|
Test build #48856 has finished for PR 10498 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove this TODO.
|
This looks good to me to merge. |
|
@cloud-fan Can we write bucketed table without partitions? Just saw you have a test case for that, but didn't see you update the DefaultWriterContainer, how can that work? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor question: Why do we support bucketing for json writer? The bucketing can only be recognized by Spark SQL, in this case, parquet is much more efficient.
As we embed CSV into Spark, I think we don't need to support bucketing for that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just because we can... I felt it's cheap to add bucketing support for JSON so I went for it.
|
I'm going to merge this. @cloud-fan can you create a follow-up pr to address some of the comments above? |
address comments in #10498 , especially #10498 (comment) Author: Wenchen Fan <[email protected]> This patch had conflicts when merged, resolved by Committer: Reynold Xin <[email protected]> Closes #10638 from cloud-fan/bucket-write.
|
Guys do you have a rough guess about when bucketing is to be implemented for |
|
Hi, There is the limitation of "Can't insert bucketed data into existing hive tables.". Do we have any plans to relax the same? I want to insert data using a query into an already existing table. @cloud-fan Do we have a Jira for the same? |
|
@tejasapatil Thanks for the response. SPARK-17729 says "Spark still won't produce bucketed data as per Hive's bucketing guarantees". I want the data to be bucketed when written. Any further leads? Just to be clear, even with "hive.enforce.bucketing" set to true, the data won't be written. Is that correct? Referencing pull request 15300's comments "Added test to ensure that INSERTs fail if strict bucket / sort is enforced". |
|
@infinitymittal : It will take time to have a fully functional support added. I had initiated a design proposal to get consensus on this could be done : https://issues.apache.org/jira/browse/SPARK-19256 In Spark, "hive.enforce.bucketing" is not respected. #15300 won't guarantee that the data written adheres to Hive's bucketing spec so approach taken there is to fail in user sets configs to enforce bucketing. This will avoid wrong data being written when user is expecting correct outputs after setting "hive.enforce.bucketing" to true. The longer term plan is to get rid of these configs and always write properly bucketed data (hive 2.x follows this model). |
|
@tejasapatil Is there any update on this regarding "always write properly bucketed data (hive 2.x follows this model)". Does spark provides this or your MR is ready to be merged into master? |
This PR adds bucket write support to Spark SQL. User can specify bucketing columns, numBuckets and sorting columns with or without partition columns. For example:
When bucketing is used, we will calculate bucket id for each record, and group the records by bucket id. For each group, we will create a file with bucket id in its name, and write data into it. For each bucket file, if sorting columns are specified, the data will be sorted before write.
Note that there may be multiply files for one bucket, as the data is distributed.
Currently we store the bucket metadata at hive metastore in a non-hive-compatible way. We use different bucketing hash function compared to hive, so we can't be compatible anyway.
Limitations: