Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Oct 11, 2016

What changes were proposed in this pull request?

Two issues regarding Dataset.dropduplicates:

  1. Dataset.dropDuplicates should consider the columns with same column name

    We find and get the first resolved attribute from output with the given column name in Dataset.dropDuplicates. When we have the more than one columns with the same name. Other columns are put into aggregation columns, instead of grouping columns.

  2. Dataset.dropDuplicates should not change the output of child plan

    We create new Alias with new exprId in Dataset.dropDuplicates now. However it causes problem when we want to select the columns as follows:

    val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS()
    // ds("_2") will cause analysis exception
    ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int])
    

Because the two issues are both related to Dataset.dropduplicates and the code changes are not big, so submitting them together as one PR.

How was this patch tested?

Jenkins tests.

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #66724 has finished for PR 15427 at commit dd6405c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Oct 11, 2016

cc @cloud-fan @hvanhovell

attr
} else {
Alias(new First(attr).toAggregateExpression(), attr.name)()
// We should keep the original exprId of the attribute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why in the comment why we should keep the original exprId? Otherwise this comment is redundant with the code itself.

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66790 has finished for PR 15427 at commit 81339dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

My thoughts:

  1. Dataset.dropDuplicates() should drop duplicates for all columns, the current implementation is wrong, this PR fixed it.
  2. Dataset.dropDuplicates(col: String) should drop the first column matching the given name, or all matched columns? Dataset.drop(col: String) also drops all matched columns, the new behaviour seems reasonable. But we should be very careful, as this is a breaking change. cc @rxin

@rxin
Copy link
Contributor

rxin commented Oct 12, 2016

  1. Dataset.dropDuplicates() should definitely drop duplicates for all columns.
  2. Dataset.dropDuplicates(col: String) should also drop duplicates for all columns matching the name.

// so we call filter instead of find.
val cols = allColumns.filter(col => resolver(col.name, colName))
if (cols.isEmpty) {
throw new AnalysisException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset.drop is a no-op if the given name doesn't match any column. Should we follow it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is:

When an user mistakenly gives wrong column to Dataset.drop, it can be easily found out.

But for Dataset.dropDuplicates, it might be harder to figure out duplicate rows are still there. So to throw an explicit exception looks more proper to me.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 064d665 Oct 13, 2016
@viirya
Copy link
Member Author

viirya commented Oct 13, 2016

Thanks for review! @rxin @cloud-fan

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Two issues regarding Dataset.dropduplicates:

1. Dataset.dropDuplicates should consider the columns with same column name

    We find and get the first resolved attribute from output with the given column name in `Dataset.dropDuplicates`. When we have the more than one columns with the same name. Other columns are put into aggregation columns, instead of grouping columns.

2. Dataset.dropDuplicates should not change the output of child plan

    We create new `Alias` with new exprId in `Dataset.dropDuplicates` now. However it causes problem when we want to select the columns as follows:

        val ds = Seq(("a", 1), ("a", 2), ("b", 1), ("a", 1)).toDS()
        // ds("_2") will cause analysis exception
        ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int])

Because the two issues are both related to `Dataset.dropduplicates` and the code changes are not big, so submitting them together as one PR.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#15427 from viirya/fix-dropduplicates.
@viirya viirya deleted the fix-dropduplicates branch December 27, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants