-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16475][SQL] Broadcast hint for SQL Queries #16925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[SPARK-16475][SQL] Broadcast Hint for SQL Queries
| * | ||
| * @param enabled whether to overwrite existing data in the table. | ||
| * @param specificPartition only data in the specified partition will be overwritten. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this?
|
Test build #72870 has finished for PR 16925 at commit
|
|
Test build #72872 has finished for PR 16925 at commit
|
|
Test build #72874 has finished for PR 16925 at commit
|
|
Actually I'm going to completely rewrite this. I don't think the current implementation makes sense. |
|
@rxin . May I update that better or do you wnat to finish here? |
|
cc @dongjoon-hyun, @cloud-fan, @gatorsmile and @hvanhovell This should be ready for review. Note that the semantics is different from the earlier versions. |
| * Substitute Hints. | ||
| * - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters. | ||
| * | ||
| * In the case of broadcast hint, we find the frontier of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this block needs to be updated.
|
Except the doc you mentioned, it looks great to me, @rxin ! |
|
Test build #72879 has finished for PR 16925 at commit
|
|
|
||
| BRACKETED_COMMENT | ||
| : '/*' .*? '*/' -> channel(HIDDEN) | ||
| : '/*' ~[+] .*? '*/' -> channel(HIDDEN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does ~[+] mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means except +.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/*+ is used hint rule in line 377.
| val withWindow = withDistinct.optionalMap(windows)(withWindows) | ||
|
|
||
| // Hint | ||
| withWindow.optionalMap(ctx.hint)(withHints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we already import ctx._ at the beginning, to be consistent we can just write hint
| } | ||
|
|
||
| /** | ||
| * Removes all the hints. This must be executed after all the other hint rules are executed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we mention that, this can happen when users specify invalid hints and we will ignore it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea let me add that.
|
|
||
| hintStatement | ||
| : hintName=identifier | ||
| | hintName=identifier '(' parameters+=identifier parameters+=identifier ')' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this is to allow SqlBase.g4 accept more general rule for the future.
So, possibly, it's for reducing the change in SqlBase.g4 in order to add new hint rules.
| } | ||
|
|
||
| if ((plan fastEquals newNode) && recurse) { | ||
| newNode.mapChildren(child => applyBroadcastHint(child, toBroadcast)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the case of self-join, we may broadcast both side, is it expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. Both being broadcastable doesn't mean we broadcast both.
| } | ||
|
|
||
| test("should work for subqueries") { | ||
| val relation = UnresolvedRelation(TableIdentifier("table"), Some("tableAlias")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: table("table").as("tableAlias")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually they do the same thing. there's a pattern match in as. let me change it,
| Hint("STREAMTABLE", Seq("a", "b", "c"), table("t").select(star()))) | ||
|
|
||
| comparePlans( | ||
| parsePlan("SELECT /*+ INDEX(t emp_job_ix) */ * FROM t"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if users writing /*+ BROADCASTJOIN(a b) */? shall we treat it as /*+ BROADCASTJOIN(a, b) */?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Both are considered equally now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin Looks like we already support space as delimiter?
|
LGTM, pending jenkins |
|
Test build #72888 has finished for PR 16925 at commit
|
|
merging to master! |
|
the latest commit hasn't finished running tests yet ... but probably fine given the small change. |
|
Test build #72893 has finished for PR 16925 at commit
|
|
A late LGTM. : ) Normally, we do not encourage users to define hints inside a view. Users can add a BROADCAST hint when they define a persistent view. That will be stored in the catalog. The hint will take an effect, as long as this view is not cached. Users can also use a BROADCAST hint for the operators (e.g., INTERSECT) that will be rewritten to a Join in optimizer. sql("CREATE VIEW v AS SELECT /*+ BROADCAST(u) */ key FROM u")
sql("SELECT /*+ BROADCAST(t) */ * FROM t INTERSECT SELECT * FROM v") |
## What changes were proposed in this pull request? This pull request introduces a simple hint infrastructure to SQL and implements broadcast join hint using the infrastructure. The hint syntax looks like the following: ``` SELECT /*+ BROADCAST(t) */ * FROM t ``` For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of relation aliases can be specified in the hint. A broadcast hint plan node will be inserted on top of any relation (that is not aliased differently), subquery, or common table expression that match the specified name. The hint resolution works by recursively traversing down the query plan to find a relation or subquery that matches one of the specified broadcast aliases. The traversal does not go past beyond any existing broadcast hints, subquery aliases. This rule happens before common table expressions. Note that there was an earlier patch in apache#14426. This is a rewrite of that patch, with different semantics and simpler test cases. ## How was this patch tested? Added a new unit test suite for the broadcast hint rule (SubstituteHintsSuite) and new test cases for parser change (in PlanParserSuite). Also added end-to-end test case in BroadcastSuite. Author: Reynold Xin <[email protected]> Author: Dongjoon Hyun <[email protected]> Closes apache#16925 from rxin/SPARK-16475-broadcast-hint.
|
|
||
| hintStatement | ||
| : hintName=identifier | ||
| | hintName=identifier '(' parameters+=identifier parameters+=identifier ')' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rule doesn't support like SELECT /*+ INDEX(a b c) */ * FROM t. To support it, it should be:
| hintName=identifier '(' parameters+=identifier parameters+=identifier* ')'
Is it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it support?
INDEX(a, b, c)
?
I think ti's a bit weird to support space as the delimiter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we support INDEX(a b)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we? we should disallow it. want to submit a pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw this was written by @dongjoon-hyun. I didn't change it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. let me prepare a tiny one.
## What changes were proposed in this pull request? A small update to apache#16925 1. Rename SubstituteHints -> ResolveHints to be more consistent with rest of the rules. 2. Added more documentation in the rule and be more defensive / future proof to skip views as well as CTEs. ## How was this patch tested? This pull request contains no real logic change and all behavior should be covered by existing tests. Author: Reynold Xin <[email protected]> Closes apache#16939 from rxin/SPARK-16475.
## What changes were proposed in this pull request? A small update to apache#16925 1. Rename SubstituteHints -> ResolveHints to be more consistent with rest of the rules. 2. Added more documentation in the rule and be more defensive / future proof to skip views as well as CTEs. ## How was this patch tested? This pull request contains no real logic change and all behavior should be covered by existing tests. Author: Reynold Xin <[email protected]> Closes apache#16939 from rxin/SPARK-16475.
What changes were proposed in this pull request?
This pull request introduces a simple hint infrastructure to SQL and implements broadcast join hint using the infrastructure.
The hint syntax looks like the following:
For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of relation aliases can be specified in the hint. A broadcast hint plan node will be inserted on top of any relation (that is not aliased differently), subquery, or common table expression that match the specified name.
The hint resolution works by recursively traversing down the query plan to find a relation or subquery that matches one of the specified broadcast aliases. The traversal does not go past beyond any existing broadcast hints, subquery aliases. This rule happens before common table expressions.
Note that there was an earlier patch in #14426. This is a rewrite of that patch, with different semantics and simpler test cases.
How was this patch tested?
Added a new unit test suite for the broadcast hint rule (SubstituteHintsSuite) and new test cases for parser change (in PlanParserSuite). Also added end-to-end test case in BroadcastSuite.