[SPARK-29063][SQL] Modify fillValue approach to support joined dataframe #25768

xuanyuanking · 2019-09-12T03:22:49Z

What changes were proposed in this pull request?

Modify the approach in DataFrameNaFunctions.fillValue, the new one uses df.withColumns which only address the columns need to be filled. After this change, there are no more ambiguous fileds detected for joined dataframe.

Why are the changes needed?

Before this change, when you have a joined table that has the same field name from both original table, fillna will fail even if you specify a subset that does not include the 'ambiguous' fields.

scala> val df1 = Seq(("f1-1", "f2", null), ("f1-2", null, null), ("f1-3", "f2", "f3-1"), ("f1-4", "f2", "f3-1")).toDF("f1", "f2", "f3")
scala> val df2 = Seq(("f1-1", null, null), ("f1-2", "f2", null), ("f1-3", "f2", "f4-1")).toDF("f1", "f2", "f4")
scala> val df_join = df1.alias("df1").join(df2.alias("df2"), Seq("f1"), joinType="left_outer")
scala> df_join.na.fill("", cols=Seq("f4"))

org.apache.spark.sql.AnalysisException: Reference 'f2' is ambiguous, could be: df1.f2, df2.f2.;

Does this PR introduce any user-facing change?

Yes, fillna operation will pass and give the right answer for a joined table.

How was this patch tested?

Local test and newly added UT.

xuanyuanking · 2019-09-12T03:24:50Z

cc @gatorsmile

gatorsmile · 2019-09-12T04:14:21Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala

+      (col.name, fillCol[T](col, value))
    }
-    df.select(projections : _*)
+    df.withColumns(fillColumnsInfo.map(_._1), fillColumnsInfo.map(_._2))


When df has a duplicate column name, what is the behavior? Also, we need to add test cases to ensure the behaviors are consistent.

When we fill the duplicate column, we'll still get AnalysisException: Reference xx is ambiguous. Add test cases in 03305be.

SparkQA · 2019-09-12T07:04:25Z

Test build #110498 has finished for PR 25768 at commit 3602807.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-17T16:47:16Z

Test build #110778 has finished for PR 25768 at commit 03305be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-09-18T03:50:10Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala

+      (col.name, fillCol[T](col, value))
    }
-    df.select(projections : _*)
+    df.withColumns(fillColumnsInfo.map(_._1), fillColumnsInfo.map(_._2))


@xuanyuanking, BTW, does this keep the order of columns? Seems previously the order of columns in its input DataFrame but seems now it can be changed.

Yes, in the new approach, we only pass in the columns found in the existing fields, and withColumns will replace the existing columns with the original order.

gatorsmile · 2019-09-20T17:09:45Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala

+      (col.name, fillCol[T](col, value))
    }
-    df.select(projections : _*)
+    df.withColumns(fillColumnsInfo.map(_._1), fillColumnsInfo.map(_._2))


we can simplify the code.

Ah, thanks for the help.

gatorsmile · 2019-09-20T17:22:49Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala


    val columnEquals = df.sparkSession.sessionState.analyzer.resolver
-    val projections = df.schema.fields.map { f =>
+    val filledColumns = df.schema.fields.filter { f =>


We can also traverse df.logicalPlan.output to avoid calling withColumns, but it might not be a big deal here.

gatorsmile · 2019-09-20T17:23:12Z

LGTM pending Jenkins.

SparkQA · 2019-09-20T21:14:28Z

Test build #111081 has finished for PR 25768 at commit 59106dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-09-20T23:26:09Z

Merged to master.

xuanyuanking · 2019-09-21T02:38:31Z

Thanks for reviewing.

Modify fillValue approach to support joined dataframe

3602807

gatorsmile reviewed Sep 12, 2019

View reviewed changes

dongjoon-hyun added the SQL label Sep 12, 2019

Add test for checking ambiguous field

03305be

HyukjinKwon reviewed Sep 18, 2019

View reviewed changes

gatorsmile reviewed Sep 20, 2019

View reviewed changes

simplify

59106dc

gatorsmile reviewed Sep 20, 2019

View reviewed changes

HyukjinKwon approved these changes Sep 20, 2019

View reviewed changes

HyukjinKwon closed this in abc88de Sep 20, 2019

xuanyuanking deleted the SPARK-29063 branch September 21, 2019 02:38

Uh oh!

[SPARK-29063][SQL] Modify fillValue approach to support joined dataframe #25768

[SPARK-29063][SQL] Modify fillValue approach to support joined dataframe #25768

Uh oh!

Conversation

xuanyuanking commented Sep 12, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

xuanyuanking commented Sep 12, 2019

Uh oh!

gatorsmile Sep 12, 2019

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Sep 17, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 12, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

HyukjinKwon Sep 18, 2019

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Sep 18, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 20, 2019

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Sep 20, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 20, 2019

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Sep 20, 2019

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

HyukjinKwon commented Sep 20, 2019

Uh oh!

xuanyuanking commented Sep 21, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants