[SPARK-24721][SQL] Extract Python UDFs at the end of optimizer #22104

icexelloss · 2018-08-14T14:29:12Z

What changes were proposed in this pull request?

In #12127 , we moved the ExtractPythonUDFs rule to the physical phase, while there was another option: do ExtractPythonUDFs at the end of optimizer.

Currently we hit 2 issues when exacting python UDFs at physical phase:

It happens after data source v2 strategy, so data source v2 strategy needs to deal with python udfs carefully and adds project to produce unsafe row for python udf. See [SPARK-25213][PYTHON] Add project to v2 scans before python filters. #22206
It happens after file source strategy, so we may keep Python UDF as data filter in FileSourceScanExec and fail the planner when try to extract it later.

This PR proposes to move ExtractPythonUDFs to the end of optimizer.

How was this patch tested?

Add a new regression test

SparkQA · 2018-08-14T14:35:41Z

Test build #94746 has finished for PR 22104 at commit 512f4b6.

This patch fails Python style tests.
This patch does not merge cleanly.
This patch adds no public classes.

icexelloss · 2018-08-14T15:14:05Z

cc @cloud-fan . Followed your suggestion here: https://issues.apache.org/jira/browse/SPARK-24721?focusedCommentId=16560537&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16560537

SparkQA · 2018-08-14T16:48:45Z

Test build #94747 has finished for PR 22104 at commit 3e167a6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-08-14T17:31:19Z

retest please

gatorsmile · 2018-08-14T17:48:06Z

python/pyspark/sql/tests.py

Add another test case for arrow-based pandas udf?

Make sense. Will add.

gatorsmile · 2018-08-14T17:50:03Z

@icexelloss Do we face the same issue for DataSourceStrategy?

icexelloss · 2018-08-14T18:17:34Z

@gatorsmile Possibly, let me see if I can create a test case

icexelloss · 2018-08-14T18:57:01Z

@gatorsmile Can you advise how to create a df with data source? All my attempts end up triggering FileSourceStrategy not DataSourceStrategy

icexelloss · 2018-08-14T19:33:40Z

I think another way to fix this is to move the logic to ExtractPythonUDF to ignore FileScanExec DataSourceScanExec and DataSourceV2ScanExec instead of changing all three rules. The downside is that if a XScanExec node with a Python UDF pushed filter throws exception somewhere else, we need to fix that too. Not sure which way is better. But either way, it would be good to create test case with data source and data source V2... Would appreciate some advise on how to create such relation in test

cloud-fan · 2018-08-15T02:33:35Z

@icexelloss we can implement a dummy data source v1/v2 at scala side and scan them in PySpark test with spark.read.format("xxx.yyy.zzz")...

HyukjinKwon · 2018-08-15T03:14:37Z

we can implement a dummy data source v1/v2 at scala side

There's an example #21007 that implement something in Scala and use it in Python side test, and there are some fake source examples in DDLSourceLoadSuite and DataSourceV2Suite

icexelloss · 2018-08-15T12:56:48Z

Thanks @HyukjinKwon and @cloud-fan ! I will take a look

SparkQA · 2018-08-15T20:00:51Z

Test build #94820 has finished for PR 22104 at commit 38f3dbb.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-08-15T20:01:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonExec.scala

This is probably another bug I found in testing this - If the input node to EvalPythonExec doesn't produce UnsafeRow, and cast here will fail.

I found this in testing when I pass in an test data source scan node, which produces GeneralInternalRow, will throw exception here.

I am happy to submit this as a separate patch if people think it's necessary

Ok.. This seems to break existing tests. Need to look into it.

icexelloss · 2018-08-15T20:03:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

I get rid of the logic previously in FileSourceStrategy to exclude PythonUDF in the filter in favor of this fix - I think this fix is cleaner.

icexelloss · 2018-08-15T20:04:50Z

python/pyspark/sql/tests.py

@gatorsmile Added tests for file source, data source and data source v2. I might need to move the pandas_udf tests into another tests because of arrow_requirement :(

Created separate tests for pandas_udf under ScalarPandasUDFTests

icexelloss · 2018-08-15T21:25:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvalPythonExec.scala

This requires some discussion.

This is probably another bug I found in testing this - If the input node to EvalPythonExec doesn't produce UnsafeRow, and cast here will fail. I don't know if we require data sources to produce unsafe rows, if not, then this is a problem.

I also don't know if this will introduce additional copy if input is already UnsafeRow - it seems like UnsafeProject should be smart to skip the copy but I am not sure if it's actually the case

Friendly ping @cloud-fan. Do you think forcing a unsafeProject here to deal with non-unsafe rows from data sources is correct? Is there a way to know whether the children nodes output unsafe rows so to avoid unnecessary unsafe projection here?

Ideally all the operators will produce UnsafeRow. If the data source does not produce UnsafeRow, Spark will make sure there will be a project above it to produce UnsafeRow, so we don't need to worry it here and safely assume the input is always UnsafeRow.

Thanks! I will remove this then.

@cloud-fan Sorry, I don't think I am being very clear...

If the data source does not produce UnsafeRow, Spark will make sure there will be a project
above it to produce UnsafeRow

I don't think this is happening for datasource V2 right now:

(Code running in pyspark test)

datasource_v2_df = self.spark.read \ .format("org.apache.spark.sql.sources.v2.SimpleDataSourceV2") \ .load() result = datasource_v2_df.withColumn('x', udf(lambda x: x, 'int')(datasource_v2_df['i'])) result.show()

The code above fails with:

Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:127) at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1$$anonfun$5.apply(EvalPythonExec.scala:126) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1074)

I think this is an issue with DataSourceV2 that probably should be addressed in another PR (DataSourceV1 works fine). @cloud-fan WDYT?

Created https://jira.apache.org/jira/browse/SPARK-25213

SparkQA · 2018-08-15T21:25:52Z

Test build #94826 has finished for PR 22104 at commit 8409611.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-15T23:24:10Z

Test build #94822 has finished for PR 22104 at commit fa7a869.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-16T17:31:47Z

Test build #94848 has finished for PR 22104 at commit dcf07fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-08-16T17:41:53Z

Tests pass now. This comment https://github.com/apache/spark/pull/22104/files#r210414941 requires some attention. @cloud-fan Do you think this is the right way to handle GenericInternalRow inputs here?

HyukjinKwon · 2018-08-17T01:48:32Z

python/pyspark/sql/tests.py

This wouldn't work if test classes are not compiled. I think we should better make another test suite that skips the test if the test classes are not existent.

@HyukjinKwon I actually am not sure how does pyspark find these classes and how to check the existence, do you have an example?

I can probably check try to check the existence of sql/core/target/scala-2.11/test-classes

Hmm... I think this is a bit fragile because things like "scala-2.11" (scala version can change).

Seems a bit over complicated to do this properly, when do we expect pyspark test to run without compiling scala test classes?

Added checks to skip the tests if scala tests are not compiled

SparkQA · 2018-08-23T21:13:24Z

Test build #95169 has finished for PR 22104 at commit 6b7445c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-08-23T21:15:40Z

@HyukjinKwon I addressed the comments. Do you mind taking a another look?

SparkQA · 2018-08-23T21:27:10Z

Test build #95171 has finished for PR 22104 at commit 4d1ae29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-25T06:17:47Z

retest this please

HyukjinKwon · 2018-08-25T06:21:03Z

Let me take another look today or tomorrow.

SparkQA · 2018-08-25T07:05:01Z

Test build #95246 has finished for PR 22104 at commit 4d1ae29.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

…st_compiled

SparkQA · 2018-08-28T02:28:38Z

Test build #95317 has finished for PR 22104 at commit 2325a4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-08-28T02:57:34Z

thanks, merging to master!

HyukjinKwon · 2018-08-28T03:00:45Z

LGTM

SparkQA · 2018-08-28T03:07:46Z

Test build #95312 has finished for PR 22104 at commit 3f0a97a.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

icexelloss · 2018-08-28T13:20:38Z

Thanks all for the review!

## What changes were proposed in this pull request? The PR excludes Python UDFs filters in FileSourceStrategy so that they don't ExtractPythonUDF rule to throw exception. It doesn't make sense to pass Python UDF filters in FileSourceStrategy anyway because they cannot be used as push down filters. ## How was this patch tested? Add a new regression test Closes apache#22104 from icexelloss/SPARK-24721-udf-filter. Authored-by: Li Jin <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2018-09-03T07:45:17Z

Just realized that the PR title and description is not updated. @icexelloss can you update them? thanks!

icexelloss · 2018-09-03T13:43:24Z

@cloud-fan Sure! Updated

gatorsmile · 2018-12-07T08:07:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala

    Batch("Optimize Metadata Only Query", Once, OptimizeMetadataOnlyQuery(catalog)) :+
-    Batch("Extract Python UDF from Aggregate", Once, ExtractPythonUDFFromAggregate) :+
+    Batch("Extract Python UDFs", Once,
+      Seq(ExtractPythonUDFFromAggregate, ExtractPythonUDFs): _*) :+


It looks weird to add this rule in our optimizer batch. We need at least some comments to explain the reason in the code.

but we already have ExtractPythonUDFFromAggregate here...

## What changes were proposed in this pull request? This is a regression introduced by apache#22104 at Spark 2.4.0. When we have Python UDF in subquery, we will hit an exception ``` Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF at scala.collection.immutable.Stream.map(Stream.scala:414) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815) ... ``` apache#22104 turned `ExtractPythonUDFs` from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once. For a subquery, the `OptimizeSubqueries` rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again. Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans. This PR proposes 2 changes to be double safe: 1. `ExtractPythonUDFs` should skip python exec plans, to make the rule idempotent 2. `ExtractPythonUDFs` should skip subquery ## How was this patch tested? a new test. Closes apache#23248 from cloud-fan/python. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

This is a regression introduced by apache#22104 at Spark 2.4.0. When we have Python UDF in subquery, we will hit an exception ``` Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF at scala.collection.immutable.Stream.map(Stream.scala:414) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815) ... ``` apache#22104 turned `ExtractPythonUDFs` from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once. For a subquery, the `OptimizeSubqueries` rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again. Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans. This PR proposes 2 changes to be double safe: 1. `ExtractPythonUDFs` should skip python exec plans, to make the rule idempotent 2. `ExtractPythonUDFs` should skip subquery a new test. Closes apache#23248 from cloud-fan/python. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…uery ## What changes were proposed in this pull request? This PR backports #23248 which seems mistakenly not backported. This is a regression introduced by #22104 at Spark 2.4.0. When we have Python UDF in subquery, we will hit an exception ``` Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to org.apache.spark.sql.catalyst.expressions.PythonUDF at scala.collection.immutable.Stream.map(Stream.scala:414) at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815) ... ``` #22104 turned `ExtractPythonUDFs` from a physical rule to optimizer rule. However, there is a difference between a physical rule and optimizer rule. A physical rule always runs once, an optimizer rule may be applied twice on a query tree even the rule is located in a batch that only runs once. For a subquery, the `OptimizeSubqueries` rule will execute the entire optimizer on the query plan inside subquery. Later on subquery will be turned to joins, and the optimizer rules will be applied to it again. Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's applied twice on a query plan inside subquery, it will produce a malformed plan. It extracts Python UDF from Python exec plans. This PR proposes 2 changes to be double safe: 1. `ExtractPythonUDFs` should skip python exec plans, to make the rule idempotent 2. `ExtractPythonUDFs` should skip subquery ## How was this patch tested? a new test. Closes #27960 from HyukjinKwon/backport-SPARK-26293. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

icexelloss force-pushed the SPARK-24721-udf-filter branch from 512f4b6 to 86d3b0c Compare August 14, 2018 15:08

gatorsmile reviewed Aug 14, 2018

View reviewed changes

icexelloss commented Aug 15, 2018

View reviewed changes

HyukjinKwon reviewed Aug 17, 2018

View reviewed changes

icexelloss force-pushed the SPARK-24721-udf-filter branch from dcf07fb to 6b7445c Compare August 23, 2018 17:37

icexelloss added 7 commits August 27, 2018 22:02

Fix python style

195ba5b

wip

f64854e

Disable data source v2 test for now

4e63704

Add require_test_compiled

ca0195e

Revert small changes

7076820

Move ExtractPythonUDFs to end of optimize stage

d440cbf

Add more tests; Fix lint; Remove hardcoded scala-2.11 from require_te…

2325a4f

…st_compiled

icexelloss force-pushed the SPARK-24721-udf-filter branch from 3f0a97a to 2325a4f Compare August 27, 2018 22:03

asfgit closed this in 8198ea5 Aug 28, 2018

icexelloss deleted the SPARK-24721-udf-filter branch August 28, 2018 13:32

icexelloss changed the title ~~[SPARK-24721][SQL] Exclude Python UDFs filters in FileSourceStrategy~~ [SPARK-24721][SQL] Move ExtractPythonUDFs to the end of optimize stage Sep 3, 2018

icexelloss changed the title ~~[SPARK-24721][SQL] Move ExtractPythonUDFs to the end of optimize stage~~ [SPARK-24721][SQL] Extract Python UDFs at the end of optimizer Sep 3, 2018

cloud-fan mentioned this pull request Dec 6, 2018

[SPARK-26293][SQL] Cast exception when having python udf in subquery #23248

Closed

gatorsmile reviewed Dec 7, 2018

View reviewed changes

cloud-fan mentioned this pull request May 22, 2019

[SPARK-27803][SQL][PYTHON] Fix column pruning for Python UDF #24675

Closed

HyukjinKwon mentioned this pull request Mar 19, 2020

[SPARK-26293][SQL][2.4] Cast exception when having python udf in subquery #27960

Closed

[SPARK-24721][SQL] Extract Python UDFs at the end of optimizer #22104

[SPARK-24721][SQL] Extract Python UDFs at the end of optimizer #22104

Uh oh!

Conversation

icexelloss commented Aug 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 14, 2018

Uh oh!

icexelloss commented Aug 14, 2018

Uh oh!

SparkQA commented Aug 14, 2018

Uh oh!

icexelloss commented Aug 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 14, 2018

Uh oh!

icexelloss commented Aug 14, 2018

Uh oh!

icexelloss commented Aug 14, 2018

Uh oh!

icexelloss commented Aug 14, 2018

Uh oh!

cloud-fan commented Aug 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

icexelloss commented Aug 15, 2018

Uh oh!

SparkQA commented Aug 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

icexelloss Aug 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 15, 2018

Uh oh!

SparkQA commented Aug 15, 2018

Uh oh!

SparkQA commented Aug 16, 2018

Uh oh!

icexelloss commented Aug 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

icexelloss commented Aug 14, 2018 •

edited

Loading

cloud-fan commented Aug 15, 2018 •

edited

Loading

HyukjinKwon commented Aug 15, 2018 •

edited

Loading

icexelloss Aug 21, 2018 •

edited

Loading