-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates #15427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #66724 has finished for PR 15427 at commit
|
| attr | ||
| } else { | ||
| Alias(new First(attr).toAggregateExpression(), attr.name)() | ||
| // We should keep the original exprId of the attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why in the comment why we should keep the original exprId? Otherwise this comment is redundant with the code itself.
|
Test build #66790 has finished for PR 15427 at commit
|
|
My thoughts:
|
|
| // so we call filter instead of find. | ||
| val cols = allColumns.filter(col => resolver(col.name, colName)) | ||
| if (cols.isEmpty) { | ||
| throw new AnalysisException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset.drop is a no-op if the given name doesn't match any column. Should we follow it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is:
When an user mistakenly gives wrong column to Dataset.drop, it can be easily found out.
But for Dataset.dropDuplicates, it might be harder to figure out duplicate rows are still there. So to throw an explicit exception looks more proper to me.
|
LGTM, merging to master! |
|
Thanks for review! @rxin @cloud-fan |
## What changes were proposed in this pull request?
Two issues regarding Dataset.dropduplicates:
1. Dataset.dropDuplicates should consider the columns with same column name
We find and get the first resolved attribute from output with the given column name in `Dataset.dropDuplicates`. When we have the more than one columns with the same name. Other columns are put into aggregation columns, instead of grouping columns.
2. Dataset.dropDuplicates should not change the output of child plan
We create new `Alias` with new exprId in `Dataset.dropDuplicates` now. However it causes problem when we want to select the columns as follows:
val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS()
// ds("_2") will cause analysis exception
ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int])
Because the two issues are both related to `Dataset.dropduplicates` and the code changes are not big, so submitting them together as one PR.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <[email protected]>
Closes apache#15427 from viirya/fix-dropduplicates.
What changes were proposed in this pull request?
Two issues regarding Dataset.dropduplicates:
Dataset.dropDuplicates should consider the columns with same column name
We find and get the first resolved attribute from output with the given column name in
Dataset.dropDuplicates. When we have the more than one columns with the same name. Other columns are put into aggregation columns, instead of grouping columns.Dataset.dropDuplicates should not change the output of child plan
We create new
Aliaswith new exprId inDataset.dropDuplicatesnow. However it causes problem when we want to select the columns as follows:Because the two issues are both related to
Dataset.dropduplicatesand the code changes are not big, so submitting them together as one PR.How was this patch tested?
Jenkins tests.