[SPARK-18634][PySpark][SQL] Corruption and Correctness issues with exploding Python UDFs #16120

viirya · 2016-12-02T15:52:09Z

What changes were proposed in this pull request?

As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL.

The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result.

>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import *
>>> 
>>> df = spark.range(10)
>>> 
>>> def return_range(value):
...   return [(i, str(i)) for i in range(value - 1, value + 1)]
... 
>>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()),
...                                                     StructField("string_val", StringType())])))
>>> 
>>> df.select("id", explode(range_udf(df.id))).show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/spark/python/pyspark/sql/dataframe.py", line 318, in show
    print(self._jdf.showString(n, 20))
  File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed
    at scala.Predef$.assert(Predef.scala:156)
    at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120)
    at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57)

The cause of this issue is, in ExtractPythonUDFs we insert BatchEvalPythonExec to run PythonUDFs in batch. BatchEvalPythonExec will add extra outputs (e.g., pythonUDF0) to original plan. In above case, the original Range only has one output id. After ExtractPythonUDFs, the added BatchEvalPythonExec has two outputs id and pythonUDF0.

Because the output of GenerateExec is given after analysis phase, in above case, it is the combination of id, i.e., the output of Range, and col. But in planning phase, we change GenerateExec's child plan to BatchEvalPythonExec with additional output attributes.

It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of GenerateExec.

However, as GenerateExec now supports wholestage codegen, the framework will input all the outputs of the child plan to GenerateExec. Then when consuming GenerateExec's output data (i.e., calling consume), the number of output attributes is different to the output variables in wholestage codegen.

To solve this issue, this patch only gives the generator's output to GenerateExec after analysis phase. GenerateExec's output is the combination of its child plan's output and the generator's output. So when we change GenerateExec's child, its output is still correct.

How was this patch tested?

Added test cases to PySpark.

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2016-12-02T18:38:51Z

Test build #69572 has finished for PR 16120 at commit a31432a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-12-02T19:18:32Z

Can you add WIP until this is ready?

viirya · 2016-12-02T23:53:02Z

OK. WIP added.

SparkQA · 2016-12-03T06:10:34Z

Test build #69608 has finished for PR 16120 at commit 44aaf39.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-03T06:22:36Z

Test build #69610 has started for PR 16120 at commit a5594f7.

viirya · 2016-12-03T08:07:01Z

retest this please.

SparkQA · 2016-12-03T10:27:28Z

Test build #69615 has finished for PR 16120 at commit a5594f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-12-04T15:22:33Z

cc @hvanhovell @cloud-fan

hvanhovell · 2016-12-05T20:22:16Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

-      // prepend the new qualifier to the existed one
-      generatorOutput.map(a => a.withQualifier(Some(q)))
-    ).getOrElse(generatorOutput)
+  val qualifiedGeneratorOutput: Seq[Attribute] = qualifier.map { q =>


Shouldn't we qualify all the output attributes?

nvm.... this is actually better.

yeah, actually this is what Generate did to prepare its output.

BTW, I did a scan in codes and I can't find anyplace to assign the qualifier parameter for Generate. I only see using None for it.

@hvanhovell Do you know any place we have specified qualifier for a Generate?

I did the same thing when I was looking at this. There is one place: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala#L522

I think the approach you took in this PR is actually the correct one, and that the current one had a latent bug which would be triggered by both the child and the generated producing an attribute with the same name.

nit: shall we make it a method or lazy val?

@cloud-fan making it a method seems better. But it is merged. Should I submit a tiny follow-up?

@cloud-fan I think the current one should be ok too.

yea it's fine to leave it.

hvanhovell · 2016-12-06T01:50:10Z

LGTM - merging to master/2.1/2.0 (if possible). Thanks!

…ploding Python UDFs ## What changes were proposed in this pull request? As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL. The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result. >>> from pyspark.sql.functions import * >>> from pyspark.sql.types import * >>> >>> df = spark.range(10) >>> >>> def return_range(value): ... return [(i, str(i)) for i in range(value - 1, value + 1)] ... >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()), ... StructField("string_val", StringType())]))) >>> >>> df.select("id", explode(range_udf(df.id))).show() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/spark/python/pyspark/sql/dataframe.py", line 318, in show print(self._jdf.showString(n, 20)) File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120) at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57) The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`. Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes. It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`. However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen. To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct. ## How was this patch tested? Added test cases to PySpark. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes #16120 from viirya/fix-py-udf-with-generator. (cherry picked from commit 3ba69b6) Signed-off-by: Herman van Hovell <[email protected]>

viirya · 2016-12-06T01:54:06Z

Thank you! @hvanhovell

## What changes were proposed in this pull request? I jumped the gun on merging #16120, and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening. ## How was this patch tested? Existing tests. Author: Herman van Hovell <[email protected]> Closes #16170 from hvanhovell/SPARK-18634. (cherry picked from commit 381ef4e) Signed-off-by: Herman van Hovell <[email protected]>

## What changes were proposed in this pull request? I jumped the gun on merging #16120, and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening. ## How was this patch tested? Existing tests. Author: Herman van Hovell <[email protected]> Closes #16170 from hvanhovell/SPARK-18634.

…ploding Python UDFs ## What changes were proposed in this pull request? As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL. The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result. >>> from pyspark.sql.functions import * >>> from pyspark.sql.types import * >>> >>> df = spark.range(10) >>> >>> def return_range(value): ... return [(i, str(i)) for i in range(value - 1, value + 1)] ... >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()), ... StructField("string_val", StringType())]))) >>> >>> df.select("id", explode(range_udf(df.id))).show() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/spark/python/pyspark/sql/dataframe.py", line 318, in show print(self._jdf.showString(n, 20)) File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120) at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57) The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`. Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes. It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`. However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen. To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct. ## How was this patch tested? Added test cases to PySpark. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes apache#16120 from viirya/fix-py-udf-with-generator.

## What changes were proposed in this pull request? I jumped the gun on merging apache#16120, and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening. ## How was this patch tested? Existing tests. Author: Herman van Hovell <[email protected]> Closes apache#16170 from hvanhovell/SPARK-18634.

…ploding Python UDFs ## What changes were proposed in this pull request? As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL. The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result. >>> from pyspark.sql.functions import * >>> from pyspark.sql.types import * >>> >>> df = spark.range(10) >>> >>> def return_range(value): ... return [(i, str(i)) for i in range(value - 1, value + 1)] ... >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()), ... StructField("string_val", StringType())]))) >>> >>> df.select("id", explode(range_udf(df.id))).show() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/spark/python/pyspark/sql/dataframe.py", line 318, in show print(self._jdf.showString(n, 20)) File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120) at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57) The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`. Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes. It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`. However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen. To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct. ## How was this patch tested? Added test cases to PySpark. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes apache#16120 from viirya/fix-py-udf-with-generator.

## What changes were proposed in this pull request? I jumped the gun on merging apache#16120, and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening. ## How was this patch tested? Existing tests. Author: Herman van Hovell <[email protected]> Closes apache#16170 from hvanhovell/SPARK-18634.

viirya changed the title ~~[SPARK-18634][PySpark][SQL] Corruption and Correctness issues with exploding Python UDFs~~ [SPARK-18634][PySpark][SQL][WIP] Corruption and Correctness issues with exploding Python UDFs Dec 2, 2016

viirya force-pushed the fix-py-udf-with-generator branch from a31432a to 44aaf39 Compare December 3, 2016 03:58

Change GenerateExec's output so PySpark's UDF can work with Generator.

a5594f7

viirya force-pushed the fix-py-udf-with-generator branch from 44aaf39 to a5594f7 Compare December 3, 2016 06:20

viirya changed the title ~~[SPARK-18634][PySpark][SQL][WIP] Corruption and Correctness issues with exploding Python UDFs~~ [SPARK-18634][PySpark][SQL] Corruption and Correctness issues with exploding Python UDFs Dec 3, 2016

hvanhovell reviewed Dec 5, 2016

View reviewed changes

asfgit closed this in 3ba69b6 Dec 6, 2016

hvanhovell mentioned this pull request Dec 6, 2016

[SPARK-18634][SQL][TRIVIAL] Touch-up Generate #16170

Closed

viirya deleted the fix-py-udf-with-generator branch December 27, 2023 18:20

[SPARK-18634][PySpark][SQL] Corruption and Correctness issues with exploding Python UDFs #16120

[SPARK-18634][PySpark][SQL] Corruption and Correctness issues with exploding Python UDFs #16120

Uh oh!

Conversation

viirya commented Dec 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

rxin commented Dec 2, 2016

Uh oh!

viirya commented Dec 2, 2016

Uh oh!

SparkQA commented Dec 3, 2016

Uh oh!

SparkQA commented Dec 3, 2016

Uh oh!

viirya commented Dec 3, 2016

Uh oh!

SparkQA commented Dec 3, 2016

Uh oh!

viirya commented Dec 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Dec 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Dec 6, 2016

Uh oh!

viirya commented Dec 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viirya commented Dec 2, 2016 •

edited

Loading

viirya Dec 6, 2016 •

edited

Loading