[SPARK-24208][SQL] Fix attribute deduplication for FlatMapGroupsInPandas #21737

mgaido91 · 2018-07-09T15:35:18Z

What changes were proposed in this pull request?

A self-join on a dataset which contains a FlatMapGroupsInPandas fails because of duplicate attributes. This happens because we are not dealing with this specific case in our dedupAttr rules.

The PR fix the issue by adding the management of the specific case

How was this patch tested?

added UT + manual tests

HyukjinKwon · 2018-07-09T15:54:29Z

@mgaido91, are you able to add a test in Python side too?

mgaido91 · 2018-07-09T15:58:20Z

@HyukjinKwon sure, I am adding it, thanks.

SparkQA · 2018-07-09T16:20:41Z

Test build #92762 has finished for PR 21737 at commit 032fef0.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-09T17:37:58Z

Test build #92760 has finished for PR 21737 at commit 5f325a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-09T21:14:10Z

Test build #92764 has finished for PR 21737 at commit 11e9f0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-09T22:14:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

          (oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions)))

+        case oldVersion @ FlatMapGroupsInPandas(_, _, output, _)
+            if AttributeSet(output).intersect(conflictingAttributes).nonEmpty =>


Why not using oldVersion.outputSet?

cc @maryannxue Deduplicating on conflicting attributes in this function is easily broken. In the long term, this is not the perfect way to handle it. We should consider to fundamentally fix it.

@gatorsmile I agree with you. Moreover, there are other possible problems in having the same expressions (with same exprId) in different part of a tree (please see SPARK-24051). So probably on long term we can add a specific rule for addressing this problem (extending/generalizing what I tried to do in SPARK-24051). What do you think?

We need to ensure all the expressions have unique IDs, instead of deduplicating it when we hit conflicts.

SparkQA · 2018-07-10T10:29:43Z

Test build #92804 has finished for PR 21737 at commit a15949b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-07-10T10:34:12Z

retest this please

SparkQA · 2018-07-10T13:06:28Z

Test build #92812 has finished for PR 21737 at commit a15949b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-07-10T13:52:26Z

retest this please

SparkQA · 2018-07-10T17:47:07Z

Test build #92821 has finished for PR 21737 at commit a15949b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-07-11T16:27:54Z

LGTM

Thanks! Merged to master/2.3

A self-join on a dataset which contains a `FlatMapGroupsInPandas` fails because of duplicate attributes. This happens because we are not dealing with this specific case in our `dedupAttr` rules. The PR fix the issue by adding the management of the specific case added UT + manual tests Author: Marco Gaido <[email protected]> Author: Marco Gaido <[email protected]> Closes #21737 from mgaido91/SPARK-24208. (cherry picked from commit ebf4bfb) Signed-off-by: Xiao Li <[email protected]>

gatorsmile · 2018-07-11T16:37:44Z

python/pyspark/sql/tests.py

                    'mixture.*aggregate function.*group aggregate pandas UDF'):
                df.groupby(df.id).agg(mean_udf(df.v), mean(df.v)).collect()

+    def test_self_join_with_pandas(self):


Just realized this test is in a wrong class. This should be moved to GroupedMapPandasUDFTests

gatorsmile · 2018-07-11T16:40:28Z

sql/core/src/test/scala/org/apache/spark/sql/GroupedDatasetSuite.scala

    datasetWithUDF.unpersist(true)
  }
+
+  test("SPARK-24208: analysis fails on self-join with FlatMapGroupsInPandas") {


This test case should be rewritten and moved to AnalysisSuite

gatorsmile · 2018-07-11T16:42:00Z

@mgaido91 Since 2.3.2 release will be out soon, I merge this fix to 2.3 branch. Regarding the comments of the test cases, could you submit a follow-up PR?

HyukjinKwon · 2018-07-11T16:44:35Z

python/pyspark/sql/tests.py

+
+        df = self.spark.createDataFrame([Row(key=1, col='A'), Row(key=1, col='B'),
+                                         Row(key=2, col='C')])
+        dfWithPandas = df.groupBy('key').apply(dummy_pandas_udf)


nit: dfWithPandas -> df_with_pandas

[SPARK-24208][SQL] Fix attribute deduplication for FlatMapGroupsInPandas

5f325a4

add python test

032fef0

fix python style

11e9f0f

gatorsmile reviewed Jul 9, 2018

View reviewed changes

address comment

a15949b

asfgit closed this in ebf4bfb Jul 11, 2018

mgaido91 mentioned this pull request Jul 11, 2018

[WIP][SPARK-24051][SQL] Replace Aliases with the same exprId #21184

Closed

gatorsmile reviewed Jul 11, 2018

View reviewed changes

HyukjinKwon reviewed Jul 11, 2018

View reviewed changes

[SPARK-24208][SQL] Fix attribute deduplication for FlatMapGroupsInPandas #21737

[SPARK-24208][SQL] Fix attribute deduplication for FlatMapGroupsInPandas #21737

Uh oh!

Conversation

mgaido91 commented Jul 9, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Jul 9, 2018

Uh oh!

mgaido91 commented Jul 9, 2018

Uh oh!

SparkQA commented Jul 9, 2018

Uh oh!

SparkQA commented Jul 9, 2018

Uh oh!

SparkQA commented Jul 9, 2018

Uh oh!

gatorsmile Jul 9, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jul 9, 2018

Choose a reason for hiding this comment

Uh oh!

mgaido91 Jul 10, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jul 11, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 10, 2018

Uh oh!

mgaido91 commented Jul 10, 2018

Uh oh!

SparkQA commented Jul 10, 2018

Uh oh!

mgaido91 commented Jul 10, 2018

Uh oh!

SparkQA commented Jul 10, 2018

Uh oh!

gatorsmile commented Jul 11, 2018

Uh oh!

gatorsmile Jul 11, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jul 11, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Jul 11, 2018

Uh oh!

HyukjinKwon Jul 11, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants