Skip to content

Conversation

@xuanyuanking
Copy link
Member

What changes were proposed in this pull request?

Modify the approach in DataFrameNaFunctions.fillValue, the new one uses df.withColumns which only address the columns need to be filled. After this change, there are no more ambiguous fileds detected for joined dataframe.

Why are the changes needed?

Before this change, when you have a joined table that has the same field name from both original table, fillna will fail even if you specify a subset that does not include the 'ambiguous' fields.

scala> val df1 = Seq(("f1-1", "f2", null), ("f1-2", null, null), ("f1-3", "f2", "f3-1"), ("f1-4", "f2", "f3-1")).toDF("f1", "f2", "f3")
scala> val df2 = Seq(("f1-1", null, null), ("f1-2", "f2", null), ("f1-3", "f2", "f4-1")).toDF("f1", "f2", "f4")
scala> val df_join = df1.alias("df1").join(df2.alias("df2"), Seq("f1"), joinType="left_outer")
scala> df_join.na.fill("", cols=Seq("f4"))

org.apache.spark.sql.AnalysisException: Reference 'f2' is ambiguous, could be: df1.f2, df2.f2.;

Does this PR introduce any user-facing change?

Yes, fillna operation will pass and give the right answer for a joined table.

How was this patch tested?

Local test and newly added UT.

@xuanyuanking
Copy link
Member Author

cc @gatorsmile

(col.name, fillCol[T](col, value))
}
df.select(projections : _*)
df.withColumns(fillColumnsInfo.map(_._1), fillColumnsInfo.map(_._2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When df has a duplicate column name, what is the behavior? Also, we need to add test cases to ensure the behaviors are consistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we fill the duplicate column, we'll still get AnalysisException: Reference xx is ambiguous. Add test cases in 03305be.

@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110498 has finished for PR 25768 at commit 3602807.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110778 has finished for PR 25768 at commit 03305be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

(col.name, fillCol[T](col, value))
}
df.select(projections : _*)
df.withColumns(fillColumnsInfo.map(_._1), fillColumnsInfo.map(_._2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuanyuanking, BTW, does this keep the order of columns? Seems previously the order of columns in its input DataFrame but seems now it can be changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the new approach, we only pass in the columns found in the existing fields, and withColumns will replace the existing columns with the original order.

(col.name, fillCol[T](col, value))
}
df.select(projections : _*)
df.withColumns(fillColumnsInfo.map(_._1), fillColumnsInfo.map(_._2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can simplify the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the help.


val columnEquals = df.sparkSession.sessionState.analyzer.resolver
val projections = df.schema.fields.map { f =>
val filledColumns = df.schema.fields.filter { f =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also traverse df.logicalPlan.output to avoid calling withColumns, but it might not be a big deal here.

@gatorsmile
Copy link
Member

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111081 has finished for PR 25768 at commit 59106dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@xuanyuanking
Copy link
Member Author

Thanks for reviewing.

@xuanyuanking xuanyuanking deleted the SPARK-29063 branch September 21, 2019 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants