[SPARK-31915][SQL][PYTHON] Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs #28777

HyukjinKwon · 2020-06-10T07:35:29Z

What changes were proposed in this pull request?

This is another approach to fix the issue. See the previous try #28745. It was too invasive so I took more conservative approach.

This PR proposes to resolve grouping attributes separately first so it can be properly referred when FlatMapGroupsInPandas and FlatMapCoGroupsInPandas are resolved without ambiguity.

Previously,

from pyspark.sql.functions import *
df = spark.createDataFrame([[1, 1]], ["column", "Score"])
@pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
    return pdf.assign(Score=0.5)

df.groupby('COLUMN').apply(my_pandas_udf).show()

was failed as below:

pyspark.sql.utils.AnalysisException: "Reference 'COLUMN' is ambiguous, could be: COLUMN, COLUMN.;"

because the unresolved COLUMN in FlatMapGroupsInPandas doesn't know which reference to take from the child projection.

After this fix, it resolves the child projection first with grouping keys and pass, to FlatMapGroupsInPandas, the attribute as a grouping key from the child projection that is positionally selected.

Why are the changes needed?

To resolve grouping keys correctly.

Does this PR introduce any user-facing change?

Yes,

from pyspark.sql.functions import *
df = spark.createDataFrame([[1, 1]], ["column", "Score"])
@pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
    return pdf.assign(Score=0.5)

df.groupby('COLUMN').apply(my_pandas_udf).show()

df1 = spark.createDataFrame([(1, 1)], ("column", "value"))
df2 = spark.createDataFrame([(1, 1)], ("column", "value"))

df1.groupby("COLUMN").cogroup(
    df2.groupby("COLUMN")
).applyInPandas(lambda r, l: r + l, df1.schema).show()

Before:

pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could be: COLUMN, COLUMN.;

pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input columns: [COLUMN, COLUMN, value, value];;
'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], <lambda>(column#9L, value#10L, column#13L, value#14L), [column#22L, value#23L]
:- Project [COLUMN#9L, column#9L, value#10L]
:  +- LogicalRDD [column#9L, value#10L], false
+- Project [COLUMN#13L, column#13L, value#14L]
   +- LogicalRDD [column#13L, value#14L], false

After:

+------+-----+
|column|Score|
+------+-----+
|     1|  0.5|
+------+-----+

+------+-----+
|column|value|
+------+-----+
|     2|    2|
+------+-----+

How was this patch tested?

Unittests were added and manually tested.

…ped and cogrouped pandas UDFs

SparkQA · 2020-06-10T11:23:30Z

Test build #123737 has finished for PR 28777 at commit c656ddb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-06-10T11:39:18Z

retest this please

SparkQA · 2020-06-10T17:53:26Z

Test build #123759 has finished for PR 28777 at commit c656ddb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM, this approach looks good

BryanCutler · 2020-06-10T22:55:13Z

merged to master, thanks @HyukjinKwon !

HyukjinKwon · 2020-06-11T00:26:16Z

Thanks guys!

HyukjinKwon · 2020-06-11T03:26:29Z

Actually, I believe this is a pretty conservative fix bug fix .. let me port this back to 3.0 as well... if requested, I don't mind porting back to 2.4 as well.

…he case sensitivity in grouped and cogrouped pandas UDFs ### What changes were proposed in this pull request? This is another approach to fix the issue. See the previous try #28745. It was too invasive so I took more conservative approach. This PR proposes to resolve grouping attributes separately first so it can be properly referred when `FlatMapGroupsInPandas` and `FlatMapCoGroupsInPandas` are resolved without ambiguity. Previously, ```python from pyspark.sql.functions import * df = spark.createDataFrame([[1, 1]], ["column", "Score"]) pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP) def my_pandas_udf(pdf): return pdf.assign(Score=0.5) df.groupby('COLUMN').apply(my_pandas_udf).show() ``` was failed as below: ``` pyspark.sql.utils.AnalysisException: "Reference 'COLUMN' is ambiguous, could be: COLUMN, COLUMN.;" ``` because the unresolved `COLUMN` in `FlatMapGroupsInPandas` doesn't know which reference to take from the child projection. After this fix, it resolves the child projection first with grouping keys and pass, to `FlatMapGroupsInPandas`, the attribute as a grouping key from the child projection that is positionally selected. ### Why are the changes needed? To resolve grouping keys correctly. ### Does this PR introduce _any_ user-facing change? Yes, ```python from pyspark.sql.functions import * df = spark.createDataFrame([[1, 1]], ["column", "Score"]) pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP) def my_pandas_udf(pdf): return pdf.assign(Score=0.5) df.groupby('COLUMN').apply(my_pandas_udf).show() ``` ```python df1 = spark.createDataFrame([(1, 1)], ("column", "value")) df2 = spark.createDataFrame([(1, 1)], ("column", "value")) df1.groupby("COLUMN").cogroup( df2.groupby("COLUMN") ).applyInPandas(lambda r, l: r + l, df1.schema).show() ``` Before: ``` pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could be: COLUMN, COLUMN.; ``` ``` pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input columns: [COLUMN, COLUMN, value, value];; 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], <lambda>(column#9L, value#10L, column#13L, value#14L), [column#22L, value#23L] :- Project [COLUMN#9L, column#9L, value#10L] : +- LogicalRDD [column#9L, value#10L], false +- Project [COLUMN#13L, column#13L, value#14L] +- LogicalRDD [column#13L, value#14L], false ``` After: ``` +------+-----+ |column|Score| +------+-----+ | 1| 0.5| +------+-----+ ``` ``` +------+-----+ |column|value| +------+-----+ | 2| 2| +------+-----+ ``` ### How was this patch tested? Unittests were added and manually tested. Closes #28777 from HyukjinKwon/SPARK-31915-another. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>

HyukjinKwon · 2020-06-11T03:55:09Z

Merged to branch-3.0 as well!

…he case sensitivity in grouped and cogrouped pandas UDFs ### What changes were proposed in this pull request? This is another approach to fix the issue. See the previous try apache#28745. It was too invasive so I took more conservative approach. This PR proposes to resolve grouping attributes separately first so it can be properly referred when `FlatMapGroupsInPandas` and `FlatMapCoGroupsInPandas` are resolved without ambiguity. Previously, ```python from pyspark.sql.functions import * df = spark.createDataFrame([[1, 1]], ["column", "Score"]) pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP) def my_pandas_udf(pdf): return pdf.assign(Score=0.5) df.groupby('COLUMN').apply(my_pandas_udf).show() ``` was failed as below: ``` pyspark.sql.utils.AnalysisException: "Reference 'COLUMN' is ambiguous, could be: COLUMN, COLUMN.;" ``` because the unresolved `COLUMN` in `FlatMapGroupsInPandas` doesn't know which reference to take from the child projection. After this fix, it resolves the child projection first with grouping keys and pass, to `FlatMapGroupsInPandas`, the attribute as a grouping key from the child projection that is positionally selected. ### Why are the changes needed? To resolve grouping keys correctly. ### Does this PR introduce _any_ user-facing change? Yes, ```python from pyspark.sql.functions import * df = spark.createDataFrame([[1, 1]], ["column", "Score"]) pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP) def my_pandas_udf(pdf): return pdf.assign(Score=0.5) df.groupby('COLUMN').apply(my_pandas_udf).show() ``` ```python df1 = spark.createDataFrame([(1, 1)], ("column", "value")) df2 = spark.createDataFrame([(1, 1)], ("column", "value")) df1.groupby("COLUMN").cogroup( df2.groupby("COLUMN") ).applyInPandas(lambda r, l: r + l, df1.schema).show() ``` Before: ``` pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could be: COLUMN, COLUMN.; ``` ``` pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input columns: [COLUMN, COLUMN, value, value];; 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], <lambda>(column#9L, value#10L, column#13L, value#14L), [column#22L, value#23L] :- Project [COLUMN#9L, column#9L, value#10L] : +- LogicalRDD [column#9L, value#10L], false +- Project [COLUMN#13L, column#13L, value#14L] +- LogicalRDD [column#13L, value#14L], false ``` After: ``` +------+-----+ |column|Score| +------+-----+ | 1| 0.5| +------+-----+ ``` ``` +------+-----+ |column|value| +------+-----+ | 2| 2| +------+-----+ ``` ### How was this patch tested? Unittests were added and manually tested. Closes apache#28777 from HyukjinKwon/SPARK-31915-another. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>

Resolve the grouping column properly per the case sensitivity in grou…

c656ddb

…ped and cogrouped pandas UDFs

probot-autolabeler bot added PYTHON SQL labels Jun 10, 2020

HyukjinKwon requested review from BryanCutler, cloud-fan and viirya June 10, 2020 07:35

cloud-fan approved these changes Jun 10, 2020

View reviewed changes

viirya approved these changes Jun 10, 2020

View reviewed changes

BryanCutler approved these changes Jun 10, 2020

View reviewed changes

BryanCutler closed this in 00d06ca Jun 10, 2020

HyukjinKwon deleted the SPARK-31915-another branch July 27, 2020 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-31915][SQL][PYTHON] Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs #28777

[SPARK-31915][SQL][PYTHON] Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs #28777

Uh oh!

HyukjinKwon commented Jun 10, 2020

Uh oh!

SparkQA commented Jun 10, 2020

Uh oh!

HyukjinKwon commented Jun 10, 2020

Uh oh!

SparkQA commented Jun 10, 2020

Uh oh!

BryanCutler left a comment

Uh oh!

BryanCutler commented Jun 10, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-31915][SQL][PYTHON] Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs #28777

[SPARK-31915][SQL][PYTHON] Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs #28777

Uh oh!

Conversation

HyukjinKwon commented Jun 10, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jun 10, 2020

Uh oh!

HyukjinKwon commented Jun 10, 2020

Uh oh!

SparkQA commented Jun 10, 2020

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Jun 10, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

HyukjinKwon commented Jun 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants