[SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates #15427

viirya · 2016-10-11T06:13:00Z

What changes were proposed in this pull request?

Two issues regarding Dataset.dropduplicates:

Dataset.dropDuplicates should consider the columns with same column name

We find and get the first resolved attribute from output with the given column name in Dataset.dropDuplicates. When we have the more than one columns with the same name. Other columns are put into aggregation columns, instead of grouping columns.
Dataset.dropDuplicates should not change the output of child plan

We create new Alias with new exprId in Dataset.dropDuplicates now. However it causes problem when we want to select the columns as follows:
```
val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS()
// ds("_2") will cause analysis exception
ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int])
```

Because the two issues are both related to Dataset.dropduplicates and the code changes are not big, so submitting them together as one PR.

How was this patch tested?

Jenkins tests.

SparkQA · 2016-10-11T08:31:25Z

Test build #66724 has finished for PR 15427 at commit dd6405c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-11T08:33:54Z

cc @cloud-fan @hvanhovell

rxin · 2016-10-11T18:41:00Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

        attr
      } else {
-        Alias(new First(attr).toAggregateExpression(), attr.name)()
+        // We should keep the original exprId of the attribute.


can you explain why in the comment why we should keep the original exprId? Otherwise this comment is redundant with the code itself.

SparkQA · 2016-10-12T05:37:40Z

Test build #66790 has finished for PR 15427 at commit 81339dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-10-12T09:35:26Z

My thoughts:

Dataset.dropDuplicates() should drop duplicates for all columns, the current implementation is wrong, this PR fixed it.
Dataset.dropDuplicates(col: String) should drop the first column matching the given name, or all matched columns? Dataset.drop(col: String) also drops all matched columns, the new behaviour seems reasonable. But we should be very careful, as this is a breaking change. cc @rxin

rxin · 2016-10-12T21:06:25Z

Dataset.dropDuplicates() should definitely drop duplicates for all columns.
Dataset.dropDuplicates(col: String) should also drop duplicates for all columns matching the name.

cloud-fan · 2016-10-13T01:51:37Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+      // so we call filter instead of find.
+      val cols = allColumns.filter(col => resolver(col.name, colName))
+      if (cols.isEmpty) {
        throw new AnalysisException(


Dataset.drop is a no-op if the given name doesn't match any column. Should we follow it?

My thought is:

When an user mistakenly gives wrong column to Dataset.drop, it can be easily found out.

But for Dataset.dropDuplicates, it might be harder to figure out duplicate rows are still there. So to throw an explicit exception looks more proper to me.

cloud-fan · 2016-10-13T05:28:21Z

LGTM, merging to master!

viirya · 2016-10-13T05:31:09Z

Thanks for review! @rxin @cloud-fan

## What changes were proposed in this pull request? Two issues regarding Dataset.dropduplicates: 1. Dataset.dropDuplicates should consider the columns with same column name We find and get the first resolved attribute from output with the given column name in `Dataset.dropDuplicates`. When we have the more than one columns with the same name. Other columns are put into aggregation columns, instead of grouping columns. 2. Dataset.dropDuplicates should not change the output of child plan We create new `Alias` with new exprId in `Dataset.dropDuplicates` now. However it causes problem when we want to select the columns as follows: val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS() // ds("_2") will cause analysis exception ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int]) Because the two issues are both related to `Dataset.dropduplicates` and the code changes are not big, so submitting them together as one PR. ## How was this patch tested? Jenkins tests. Author: Liang-Chi Hsieh <[email protected]> Closes apache#15427 from viirya/fix-dropduplicates.

Fix Dataset.dropduplicates.

dd6405c

viirya mentioned this pull request Oct 11, 2016

[SPARK-5992][ML] Locality Sensitive Hashing #15148

Closed

rxin reviewed Oct 11, 2016

View reviewed changes

Add more comments.

81339dc

cloud-fan reviewed Oct 13, 2016

View reviewed changes

asfgit closed this in 064d665 Oct 13, 2016

zsxwing mentioned this pull request Jan 12, 2017

[SPARK-19065][SQL]Don't inherit expression id in dropDuplicates #16564

Closed

viirya deleted the fix-dropduplicates branch December 27, 2023 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates #15427

[SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates #15427

Uh oh!

viirya commented Oct 11, 2016

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

viirya commented Oct 11, 2016

Uh oh!

rxin Oct 11, 2016

Uh oh!

SparkQA commented Oct 12, 2016

Uh oh!

cloud-fan commented Oct 12, 2016

Uh oh!

rxin commented Oct 12, 2016

Uh oh!

cloud-fan Oct 13, 2016

Uh oh!

viirya Oct 13, 2016

Uh oh!

cloud-fan commented Oct 13, 2016

Uh oh!

viirya commented Oct 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates #15427

[SPARK-17866][SPARK-17867][SQL] Fix Dataset.dropduplicates #15427

Uh oh!

Conversation

viirya commented Oct 11, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 11, 2016

Uh oh!

viirya commented Oct 11, 2016

Uh oh!

rxin Oct 11, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 12, 2016

Uh oh!

cloud-fan commented Oct 12, 2016

Uh oh!

rxin commented Oct 12, 2016

Uh oh!

cloud-fan Oct 13, 2016

Choose a reason for hiding this comment

Uh oh!

viirya Oct 13, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 13, 2016

Uh oh!

viirya commented Oct 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants