[SPARK-15888] [SQL] fix Python UDF with aggregate #13682

davies · 2016-06-15T07:01:25Z

What changes were proposed in this pull request?

After we move the ExtractPythonUDF rule into physical plan, Python UDF can't work on top of aggregate anymore, because they can't be evaluated before aggregate, should be evaluated after aggregate. This PR add another rule to extract these kind of Python UDF from logical aggregate, create a Project on top of Aggregate.

How was this patch tested?

Added regression tests. The plan of added test query looks like this:

== Parsed Logical Plan ==
'Project [<lambda>('k, 's) AS t#26]
+- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
   +- LogicalRDD [key#5L, value#6]

== Analyzed Logical Plan ==
t: int
Project [<lambda>(k#17, s#22L) AS t#26]
+- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
   +- LogicalRDD [key#5L, value#6]

== Optimized Logical Plan ==
Project [<lambda>(agg#29, agg#30L) AS t#26]
+- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS agg#29, sum(cast(<lambda>(value#6) as bigint)) AS agg#30L]
   +- LogicalRDD [key#5L, value#6]

== Physical Plan ==
*Project [pythonUDF0#37 AS t#26]
+- BatchEvalPython [<lambda>(agg#29, agg#30L)], [agg#29, agg#30L, pythonUDF0#37]
   +- *HashAggregate(key=[<lambda>(key#5L)#31], functions=[sum(cast(<lambda>(value#6) as bigint))], output=[agg#29,agg#30L])
      +- Exchange hashpartitioning(<lambda>(key#5L)#31, 200)
         +- *HashAggregate(key=[pythonUDF0#34 AS <lambda>(key#5L)#31], functions=[partial_sum(cast(pythonUDF1#35 as bigint))], output=[<lambda>(key#5L)#31,sum#33L])
            +- BatchEvalPython [<lambda>(key#5L), <lambda>(value#6)], [key#5L, value#6, pythonUDF0#34, pythonUDF1#35]
               +- Scan ExistingRDD[key#5L,value#6]

davies · 2016-06-15T07:03:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala

remove the ! for BatchEvalPythonExec

SparkQA · 2016-06-15T07:03:50Z

Test build #60561 has finished for PR 13682 at commit 4d5f075.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-15T08:40:46Z

Test build #60562 has finished for PR 13682 at commit 31b42eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-06-15T16:56:45Z

cc @gatorsmile @JoshRosen

gatorsmile · 2016-06-15T20:35:10Z

LGTM. Thank you for your fix!

davies · 2016-06-15T20:37:59Z

Merging this into master and 2.0, thanks!

## What changes were proposed in this pull request? After we move the ExtractPythonUDF rule into physical plan, Python UDF can't work on top of aggregate anymore, because they can't be evaluated before aggregate, should be evaluated after aggregate. This PR add another rule to extract these kind of Python UDF from logical aggregate, create a Project on top of Aggregate. ## How was this patch tested? Added regression tests. The plan of added test query looks like this: ``` == Parsed Logical Plan == 'Project [<lambda>('k, 's) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L] +- LogicalRDD [key#5L, value#6] == Analyzed Logical Plan == t: int Project [<lambda>(k#17, s#22L) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L] +- LogicalRDD [key#5L, value#6] == Optimized Logical Plan == Project [<lambda>(agg#29, agg#30L) AS t#26] +- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS agg#29, sum(cast(<lambda>(value#6) as bigint)) AS agg#30L] +- LogicalRDD [key#5L, value#6] == Physical Plan == *Project [pythonUDF0#37 AS t#26] +- BatchEvalPython [<lambda>(agg#29, agg#30L)], [agg#29, agg#30L, pythonUDF0#37] +- *HashAggregate(key=[<lambda>(key#5L)#31], functions=[sum(cast(<lambda>(value#6) as bigint))], output=[agg#29,agg#30L]) +- Exchange hashpartitioning(<lambda>(key#5L)#31, 200) +- *HashAggregate(key=[pythonUDF0#34 AS <lambda>(key#5L)#31], functions=[partial_sum(cast(pythonUDF1#35 as bigint))], output=[<lambda>(key#5L)#31,sum#33L]) +- BatchEvalPython [<lambda>(key#5L), <lambda>(value#6)], [key#5L, value#6, pythonUDF0#34, pythonUDF1#35] +- Scan ExistingRDD[key#5L,value#6] ``` Author: Davies Liu <[email protected]> Closes #13682 from davies/fix_py_udf. (cherry picked from commit 5389013) Signed-off-by: Davies Liu <[email protected]>

davies force-pushed the fix_py_udf branch from 4d5f075 to bad0a2a Compare June 15, 2016 07:02

davies reviewed Jun 15, 2016
View reviewed changes

fix Python UDF with aggregate

31b42eb

davies force-pushed the fix_py_udf branch from bad0a2a to 31b42eb Compare June 15, 2016 07:04

asfgit closed this in 5389013 Jun 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-15888] [SQL] fix Python UDF with aggregate #13682

[SPARK-15888] [SQL] fix Python UDF with aggregate #13682

Uh oh!

davies commented Jun 15, 2016

Uh oh!

davies Jun 15, 2016

Uh oh!

SparkQA commented Jun 15, 2016

Uh oh!

SparkQA commented Jun 15, 2016

Uh oh!

davies commented Jun 15, 2016

Uh oh!

gatorsmile commented Jun 15, 2016

Uh oh!

davies commented Jun 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-15888] [SQL] fix Python UDF with aggregate #13682

[SPARK-15888] [SQL] fix Python UDF with aggregate #13682

Uh oh!

Conversation

davies commented Jun 15, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

davies Jun 15, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 15, 2016

Uh oh!

SparkQA commented Jun 15, 2016

Uh oh!

davies commented Jun 15, 2016

Uh oh!

gatorsmile commented Jun 15, 2016

Uh oh!

davies commented Jun 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants