-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8077][SQL] Optimization for TreeNodes with large numbers of children #6673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
For example large IN clauses
|
I'm not sure this works as children is a def, and therefore no guarantees that set of children is immutable. I need to think about it a bit more. |
|
Regarding generating lazy val childrenSet from children, given lack of guarantee that children produces an unchanging sequence? It looks like the intention is that children will not change? |
|
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after :.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about naming it containsChild so that we can use it like containsChild(arg)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
Thanks
Mick
On 12 Jun 2015, at 05:34, Wenchen Fan [email protected] wrote:
In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala #6673 (comment):
@@ -62,6 +62,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] {
/** Returns a Seq of the children of this node */
def children: Seq[BaseType]
- lazy val childrenSet:Set[TreeNode[_]] = children.toSet
how about naming it containsChild so that we can use it like containsChild(arg)?—
Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/6673/files#r32289725.
|
I think its reasonable to assume |
|
Test build #34737 has finished for PR 6673 at commit
|
Correct style and rename attribute
Add comment about required immutability of children
|
Test build #34762 has finished for PR 6673 at commit
|
|
Thanks! Merging to master. |
…hildren
For example large IN clauses
Large IN clauses are parsed very slowly. For example SQL below (10K items in IN) takes 45-50s.
s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 10000).map("n" + _).mkString("','")}')"""
This is principally due to TreeNode which repeatedly call contains on children, where children in this case is a List that is 10K long. In effect parsing for large IN clauses is O(N squared).
A lazily initialised Set based on children for contains reduces parse time to around 2.5s
Author: Michael Davies <[email protected]>
Closes apache#6673 from MickDavies/SPARK-8077 and squashes the following commits:
38cd425 [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children
d80103b [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children
e6be8be [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children
For example large IN clauses
Large IN clauses are parsed very slowly. For example SQL below (10K items in IN) takes 45-50s.
s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 10000).map("n" + _).mkString("','")}')"""
This is principally due to TreeNode which repeatedly call contains on children, where children in this case is a List that is 10K long. In effect parsing for large IN clauses is O(N squared).
A lazily initialised Set based on children for contains reduces parse time to around 2.5s