[SPARK-8077][SQL] Optimization for TreeNodes with large numbers of children #6673

MickDavies · 2015-06-05T18:06:03Z

For example large IN clauses

Large IN clauses are parsed very slowly. For example SQL below (10K items in IN) takes 45-50s.

s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 10000).map("n" + _).mkString("','")}')"""

This is principally due to TreeNode which repeatedly call contains on children, where children in this case is a List that is 10K long. In effect parsing for large IN clauses is O(N squared).
A lazily initialised Set based on children for contains reduces parse time to around 2.5s

For example large IN clauses

MickDavies · 2015-06-05T18:11:53Z

I'm not sure this works as children is a def, and therefore no guarantees that set of children is immutable. I need to think about it a bit more.

MickDavies · 2015-06-07T12:19:37Z

Regarding generating lazy val childrenSet from children, given lack of guarantee that children produces an unchanging sequence? It looks like the intention is that children will not change?

marmbrus · 2015-06-12T01:08:43Z

ok to test

marmbrus · 2015-06-12T01:09:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

Space after :.

how about naming it containsChild so that we can use it like containsChild(arg)?

Good idea.

Thanks

Mick

On 12 Jun 2015, at 05:34, Wenchen Fan [email protected] wrote:

In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala #6673 (comment):

@@ -62,6 +62,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] {
/** Returns a Seq of the children of this node */
def children: Seq[BaseType]

lazy val childrenSet:Set[TreeNode[_]] = children.toSet
how about naming it containsChild so that we can use it like containsChild(arg)?

—
Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/6673/files#r32289725.

marmbrus · 2015-06-12T01:10:36Z

I think its reasonable to assume children will not change as TreeNodes are generally expected to be immutable. I'd add this requirement to the method's scala doc though.

SparkQA · 2015-06-12T01:14:57Z

Test build #34737 has finished for PR 6673 at commit e6be8be.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

Correct style and rename attribute

Add comment about required immutability of children

SparkQA · 2015-06-12T08:59:24Z

Test build #34762 has finished for PR 6673 at commit 38cd425.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-06-17T19:56:57Z

Thanks! Merging to master.

…hildren For example large IN clauses Large IN clauses are parsed very slowly. For example SQL below (10K items in IN) takes 45-50s. s"""SELECT * FROM Person WHERE ForeName IN ('${(1 to 10000).map("n" + _).mkString("','")}')""" This is principally due to TreeNode which repeatedly call contains on children, where children in this case is a List that is 10K long. In effect parsing for large IN clauses is O(N squared). A lazily initialised Set based on children for contains reduces parse time to around 2.5s Author: Michael Davies <[email protected]> Closes apache#6673 from MickDavies/SPARK-8077 and squashes the following commits: 38cd425 [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children d80103b [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children e6be8be [Michael Davies] SPARK-8077: Optimization for TreeNodes with large numbers of children

SPARK-8077: Optimization for TreeNodes with large numbers of children

e6be8be

For example large IN clauses

MickDavies closed this Jun 5, 2015

MickDavies reopened this Jun 7, 2015

marmbrus reviewed Jun 12, 2015
View reviewed changes

MickDavies added 2 commits June 12, 2015 08:07

SPARK-8077: Optimization for TreeNodes with large numbers of children

d80103b

Correct style and rename attribute

SPARK-8077: Optimization for TreeNodes with large numbers of children

38cd425

Add comment about required immutability of children

asfgit closed this in 0c1b2df Jun 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-8077][SQL] Optimization for TreeNodes with large numbers of children #6673

[SPARK-8077][SQL] Optimization for TreeNodes with large numbers of children #6673

Uh oh!

MickDavies commented Jun 5, 2015

Uh oh!

MickDavies commented Jun 5, 2015

Uh oh!

MickDavies commented Jun 7, 2015

Uh oh!

marmbrus commented Jun 12, 2015

Uh oh!

marmbrus Jun 12, 2015

Uh oh!

cloud-fan Jun 12, 2015

Uh oh!

MickDavies Jun 12, 2015

Uh oh!

marmbrus commented Jun 12, 2015

Uh oh!

SparkQA commented Jun 12, 2015

Uh oh!

SparkQA commented Jun 12, 2015

Uh oh!

marmbrus commented Jun 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-8077][SQL] Optimization for TreeNodes with large numbers of children #6673

[SPARK-8077][SQL] Optimization for TreeNodes with large numbers of children #6673

Uh oh!

Conversation

MickDavies commented Jun 5, 2015

Uh oh!

MickDavies commented Jun 5, 2015

Uh oh!

MickDavies commented Jun 7, 2015

Uh oh!

marmbrus commented Jun 12, 2015

Uh oh!

marmbrus Jun 12, 2015

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 12, 2015

Choose a reason for hiding this comment

Uh oh!

MickDavies Jun 12, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Jun 12, 2015

Uh oh!

SparkQA commented Jun 12, 2015

Uh oh!

SparkQA commented Jun 12, 2015

Uh oh!

marmbrus commented Jun 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants