Skip to content

Conversation

@MickDavies
Copy link
Contributor

For example large IN clauses

Large IN clauses are parsed very slowly. For example SQL below (10K items in IN) takes 45-50s.

s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 10000).map("n" + _).mkString("','")}')"""

This is principally due to TreeNode which repeatedly call contains on children, where children in this case is a List that is 10K long. In effect parsing for large IN clauses is O(N squared).
A lazily initialised Set based on children for contains reduces parse time to around 2.5s

@MickDavies
Copy link
Contributor Author

I'm not sure this works as children is a def, and therefore no guarantees that set of children is immutable. I need to think about it a bit more.

@MickDavies MickDavies closed this Jun 5, 2015
@MickDavies MickDavies reopened this Jun 7, 2015
@MickDavies
Copy link
Contributor Author

Regarding generating lazy val childrenSet from children, given lack of guarantee that children produces an unchanging sequence? It looks like the intention is that children will not change?

@marmbrus
Copy link
Contributor

ok to test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after :.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about naming it containsChild so that we can use it like containsChild(arg)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

Thanks

Mick

On 12 Jun 2015, at 05:34, Wenchen Fan [email protected] wrote:

In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala #6673 (comment):

@@ -62,6 +62,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] {
/** Returns a Seq of the children of this node */
def children: Seq[BaseType]

  • lazy val childrenSet:Set[TreeNode[_]] = children.toSet
    how about naming it containsChild so that we can use it like containsChild(arg)?


Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/6673/files#r32289725.

@marmbrus
Copy link
Contributor

I think its reasonable to assume children will not change as TreeNodes are generally expected to be immutable. I'd add this requirement to the method's scala doc though.

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34737 has finished for PR 6673 at commit e6be8be.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34762 has finished for PR 6673 at commit 38cd425.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Thanks! Merging to master.

@asfgit asfgit closed this in 0c1b2df Jun 17, 2015
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…hildren

For example large IN clauses

Large IN clauses are parsed very slowly. For example SQL below (10K items in IN) takes 45-50s.

s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 10000).map("n" + _).mkString("','")}')"""

This is principally due to TreeNode which repeatedly call contains on children, where children in this case is a List that is 10K long. In effect parsing for large IN clauses is O(N squared).
A lazily initialised Set based on children for contains reduces parse time to around 2.5s

Author: Michael Davies <[email protected]>

Closes apache#6673 from MickDavies/SPARK-8077 and squashes the following commits:

38cd425 [Michael Davies] SPARK-8077: Optimization for  TreeNodes with large numbers of children
d80103b [Michael Davies] SPARK-8077: Optimization for  TreeNodes with large numbers of children
e6be8be [Michael Davies] SPARK-8077: Optimization for  TreeNodes with large numbers of children
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants