-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-7097][SQL]: Partitioned tables should only consider referred partitions in query during size estimation for checking against autoBroadcastJoinThreshold #5668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #30860 has finished for PR 5668 at commit
|
|
retest please |
|
Test build #30868 has finished for PR 5668 at commit
|
|
Test build #30919 has finished for PR 5668 at commit
|
|
Test build #30981 has started for PR 5668 at commit |
|
Test build #31139 has finished for PR 5668 at commit
|
|
Test build #32496 has finished for PR 5668 at commit
|
|
Test build #32592 has finished for PR 5668 at commit
|
…ns in query during size estimation for checking against autoBroadcastJoinThreshold
…ectly updated when the query is being run. This is because totalsize of partitions gets updated both when alter table is called as well as when insert into overwrite partition is called.
… because of hive test failures)
|
Test build #36258 has finished for PR 5668 at commit
|
|
Test build #36276 has finished for PR 5668 at commit
|
|
Looks like an old patch. @yhuai would you mind taking a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Test build #44983 has finished for PR 5668 at commit
|
This PR attempts to add support for better size estimation in case of partitioned tables so that only the referred partition's size are taken into consideration when testing against autoBroadCastJoinThreshold and deciding whether to create a broadcast join or shuffle hash join.
We can use the values that get stored in the hive metastore during alter table / insert into partition commands to estimate the size of each of the referred partitions.
In most cases, since both alter table query and 'insert into table partition <part=val> select * from .....' store the partition size in the metastore automatically, we expect to get the correct value of partition size. We could use Analyze table query as well in case there is some mismatch.