Skip to content

Conversation

@rxin
Copy link
Contributor

@rxin rxin commented Feb 14, 2017

What changes were proposed in this pull request?

This pull request introduces a simple hint infrastructure to SQL and implements broadcast join hint using the infrastructure.

The hint syntax looks like the following:

SELECT /*+ BROADCAST(t) */ * FROM t

For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of relation aliases can be specified in the hint. A broadcast hint plan node will be inserted on top of any relation (that is not aliased differently), subquery, or common table expression that match the specified name.

The hint resolution works by recursively traversing down the query plan to find a relation or subquery that matches one of the specified broadcast aliases. The traversal does not go past beyond any existing broadcast hints, subquery aliases. This rule happens before common table expressions.

Note that there was an earlier patch in #14426. This is a rewrite of that patch, with different semantics and simpler test cases.

How was this patch tested?

Added a new unit test suite for the broadcast hint rule (SubstituteHintsSuite) and new test cases for parser change (in PlanParserSuite). Also added end-to-end test case in BroadcastSuite.

*
* @param enabled whether to overwrite existing data in the table.
* @param specificPartition only data in the specified partition will be overwritten.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72870 has finished for PR 16925 at commit 318bc03.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Hint(name: String, parameters: Seq[String], child: LogicalPlan) extends UnaryNode
  • case class OverwriteOptions(

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72872 has finished for PR 16925 at commit c702e3e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72874 has finished for PR 16925 at commit a095df3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SubstituteHints(conf: CatalystConf) extends Rule[LogicalPlan]

@rxin rxin changed the title [SPARK-16475][SQL] Broadcast Hint for SQL Queries [SPARK-16475][SQL] Broadcast Hint for SQL Queries - WIP Feb 14, 2017
@rxin
Copy link
Contributor Author

rxin commented Feb 14, 2017

Actually I'm going to completely rewrite this. I don't think the current implementation makes sense.

@dongjoon-hyun
Copy link
Member

@rxin . May I update that better or do you wnat to finish here?
Anything is okay for me. Anyway, thank you for everything!

@rxin rxin changed the title [SPARK-16475][SQL] Broadcast Hint for SQL Queries - WIP [SPARK-16475][SQL] Broadcast Hint for SQL Queries Feb 14, 2017
@rxin
Copy link
Contributor Author

rxin commented Feb 14, 2017

cc @dongjoon-hyun, @cloud-fan, @gatorsmile and @hvanhovell This should be ready for review. Note that the semantics is different from the earlier versions.

* Substitute Hints.
* - BROADCAST/BROADCASTJOIN/MAPJOIN match the closest table with the given name parameters.
*
* In the case of broadcast hint, we find the frontier of
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block needs to be updated.

@dongjoon-hyun
Copy link
Member

Except the doc you mentioned, it looks great to me, @rxin !
It's great to have this in Spark SQL finally.

@rxin rxin changed the title [SPARK-16475][SQL] Broadcast Hint for SQL Queries [SPARK-16475][SQL] Broadcast hint for SQL Queries Feb 14, 2017
@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72879 has finished for PR 16925 at commit 617f8a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


BRACKETED_COMMENT
: '/*' .*? '*/' -> channel(HIDDEN)
: '/*' ~[+] .*? '*/' -> channel(HIDDEN)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does ~[+] mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means except +.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/*+ is used hint rule in line 377.

val withWindow = withDistinct.optionalMap(windows)(withWindows)

// Hint
withWindow.optionalMap(ctx.hint)(withHints)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we already import ctx._ at the beginning, to be consistent we can just write hint

}

/**
* Removes all the hints. This must be executed after all the other hint rules are executed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we mention that, this can happen when users specify invalid hints and we will ignore it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea let me add that.


hintStatement
: hintName=identifier
| hintName=identifier '(' parameters+=identifier parameters+=identifier ')'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this is to allow SqlBase.g4 accept more general rule for the future.
So, possibly, it's for reducing the change in SqlBase.g4 in order to add new hint rules.

}

if ((plan fastEquals newNode) && recurse) {
newNode.mapChildren(child => applyBroadcastHint(child, toBroadcast))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case of self-join, we may broadcast both side, is it expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. Both being broadcastable doesn't mean we broadcast both.

}

test("should work for subqueries") {
val relation = UnresolvedRelation(TableIdentifier("table"), Some("tableAlias"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: table("table").as("tableAlias")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do different things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually they do the same thing. there's a pattern match in as. let me change it,

Hint("STREAMTABLE", Seq("a", "b", "c"), table("t").select(star())))

comparePlans(
parsePlan("SELECT /*+ INDEX(t emp_job_ix) */ * FROM t"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if users writing /*+ BROADCASTJOIN(a b) */? shall we treat it as /*+ BROADCASTJOIN(a, b) */?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Both are considered equally now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin Looks like we already support space as delimiter?

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72888 has finished for PR 16925 at commit 51a73d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SubstituteBroadcastHints(conf: CatalystConf) extends Rule[LogicalPlan]

@cloud-fan
Copy link
Contributor

merging to master!

@asfgit asfgit closed this in da7aef7 Feb 14, 2017
@rxin
Copy link
Contributor Author

rxin commented Feb 14, 2017

the latest commit hasn't finished running tests yet ... but probably fine given the small change.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72893 has finished for PR 16925 at commit 0d42978.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

A late LGTM. : )

Normally, we do not encourage users to define hints inside a view. Users can add a BROADCAST hint when they define a persistent view. That will be stored in the catalog. The hint will take an effect, as long as this view is not cached.

Users can also use a BROADCAST hint for the operators (e.g., INTERSECT) that will be rewritten to a Join in optimizer.

sql("CREATE VIEW v AS SELECT /*+ BROADCAST(u) */ key FROM u")
sql("SELECT /*+ BROADCAST(t) */ * FROM t INTERSECT SELECT * FROM v")

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?
This pull request introduces a simple hint infrastructure to SQL and implements broadcast join hint using the infrastructure.

The hint syntax looks like the following:
```
SELECT /*+ BROADCAST(t) */ * FROM t
```

For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of relation aliases can be specified in the hint. A broadcast hint plan node will be inserted on top of any relation (that is not aliased differently), subquery, or common table expression that match the specified name.

The hint resolution works by recursively traversing down the query plan to find a relation or subquery that matches one of the specified broadcast aliases. The traversal does not go past beyond any existing broadcast hints, subquery aliases. This rule happens before common table expressions.

Note that there was an earlier patch in apache#14426. This is a rewrite of that patch, with different semantics and simpler test cases.

## How was this patch tested?
Added a new unit test suite for the broadcast hint rule (SubstituteHintsSuite) and new test cases for parser change (in PlanParserSuite). Also added end-to-end test case in BroadcastSuite.

Author: Reynold Xin <[email protected]>
Author: Dongjoon Hyun <[email protected]>

Closes apache#16925 from rxin/SPARK-16475-broadcast-hint.

hintStatement
: hintName=identifier
| hintName=identifier '(' parameters+=identifier parameters+=identifier ')'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule doesn't support like SELECT /*+ INDEX(a b c) */ * FROM t. To support it, it should be:

| hintName=identifier '(' parameters+=identifier parameters+=identifier* ')'

Is it intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it support?

INDEX(a, b, c)

?

I think ti's a bit weird to support space as the delimiter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we support INDEX(a b)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we? we should disallow it. want to submit a pr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw this was written by @dongjoon-hyun. I didn't change it :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. let me prepare a tiny one.

ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?
A small update to apache#16925

1. Rename SubstituteHints -> ResolveHints to be more consistent with rest of the rules.
2. Added more documentation in the rule and be more defensive / future proof to skip views as well as CTEs.

## How was this patch tested?
This pull request contains no real logic change and all behavior should be covered by existing tests.

Author: Reynold Xin <[email protected]>

Closes apache#16939 from rxin/SPARK-16475.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 16, 2017
## What changes were proposed in this pull request?
A small update to apache#16925

1. Rename SubstituteHints -> ResolveHints to be more consistent with rest of the rules.
2. Added more documentation in the rule and be more defensive / future proof to skip views as well as CTEs.

## How was this patch tested?
This pull request contains no real logic change and all behavior should be covered by existing tests.

Author: Reynold Xin <[email protected]>

Closes apache#16939 from rxin/SPARK-16475.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants