-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-29063][SQL] Modify fillValue approach to support joined dataframe #25768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @gatorsmile |
| (col.name, fillCol[T](col, value)) | ||
| } | ||
| df.select(projections : _*) | ||
| df.withColumns(fillColumnsInfo.map(_._1), fillColumnsInfo.map(_._2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When df has a duplicate column name, what is the behavior? Also, we need to add test cases to ensure the behaviors are consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we fill the duplicate column, we'll still get AnalysisException: Reference xx is ambiguous. Add test cases in 03305be.
|
Test build #110498 has finished for PR 25768 at commit
|
|
Test build #110778 has finished for PR 25768 at commit
|
| (col.name, fillCol[T](col, value)) | ||
| } | ||
| df.select(projections : _*) | ||
| df.withColumns(fillColumnsInfo.map(_._1), fillColumnsInfo.map(_._2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xuanyuanking, BTW, does this keep the order of columns? Seems previously the order of columns in its input DataFrame but seems now it can be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the new approach, we only pass in the columns found in the existing fields, and withColumns will replace the existing columns with the original order.
| (col.name, fillCol[T](col, value)) | ||
| } | ||
| df.select(projections : _*) | ||
| df.withColumns(fillColumnsInfo.map(_._1), fillColumnsInfo.map(_._2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can simplify the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the help.
|
|
||
| val columnEquals = df.sparkSession.sessionState.analyzer.resolver | ||
| val projections = df.schema.fields.map { f => | ||
| val filledColumns = df.schema.fields.filter { f => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also traverse df.logicalPlan.output to avoid calling withColumns, but it might not be a big deal here.
|
LGTM pending Jenkins. |
|
Test build #111081 has finished for PR 25768 at commit
|
|
Merged to master. |
|
Thanks for reviewing. |
What changes were proposed in this pull request?
Modify the approach in
DataFrameNaFunctions.fillValue, the new one usesdf.withColumnswhich only address the columns need to be filled. After this change, there are no more ambiguous fileds detected for joined dataframe.Why are the changes needed?
Before this change, when you have a joined table that has the same field name from both original table, fillna will fail even if you specify a subset that does not include the 'ambiguous' fields.
Does this PR introduce any user-facing change?
Yes, fillna operation will pass and give the right answer for a joined table.
How was this patch tested?
Local test and newly added UT.