[SPARK-7324][SQL] Add DataFrame.dropDuplicates #5870

kaka1992 · 2015-05-03T06:15:21Z

Similar to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
def dropDuplicates(): DataFrame
def dropDuplicates(subset: Seq[String]): DataFrame

AmplabJenkins · 2015-05-03T06:17:10Z

Can one of the admins verify this patch?

viirya · 2015-05-04T07:12:20Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

Suppose that distinct is as same as dropDuplicates for removing duplicate rows? If they are the same, which implementation is better? GroupedData or Distinct node?

@viirya No, dropDuplicates is used to remove duplicate rows that are the same in some columns or in all columns (default) . The default version is as same as distinct.

You can also select subset of columns and then do distinct?

If you do this, you can't get all columns.

kaka1992 · 2015-05-05T02:44:21Z

@viirya Please test this.

kaka1992 · 2015-05-06T01:58:58Z

@rxin Please test this.

rxin · 2015-05-06T02:04:17Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

this is just

groupBy(subset : _*).agg(columns.map(columnFirst) : _*)

(you might need to take the head and then vararg the tail)

rxin · 2015-05-06T02:05:44Z

Jenkins, test this please.

AmplabJenkins · 2015-05-06T02:07:10Z

Merged build triggered.

AmplabJenkins · 2015-05-06T02:07:16Z

Merged build started.

SparkQA · 2015-05-06T02:09:13Z

Test build #31935 has started for PR 5870 at commit b6f1879.

SparkQA · 2015-05-06T04:00:35Z

Test build #31935 has finished for PR 5870 at commit b6f1879.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-06T04:00:39Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-06T04:00:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31935/
Test PASSed.

kaka1992 · 2015-05-06T05:44:27Z

@rxin Please retest this.

rxin · 2015-05-06T06:28:42Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

we should also check if subset.size == 0 or columns.size == 0, then simply return an empty data frame (there is one in SQLContext).

adrian-wang · 2015-05-07T03:04:59Z

I think keep the takeFirst parameter would make this better to understand.

This should also close #5870 Author: Reynold Xin <[email protected]> Closes #6066 from rxin/dropDups and squashes the following commits: 130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates (cherry picked from commit b6bf4f7) Signed-off-by: Michael Armbrust <[email protected]>

This should also close apache#5870 Author: Reynold Xin <[email protected]> Closes apache#6066 from rxin/dropDups and squashes the following commits: 130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates

云峤 added 8 commits May 1, 2015 23:50

[SPARK-7294] ADD BETWEEN

d11d5b9

[SPARK-7294] ADD BETWEEN

baf839b

[SPARK-7294] ADD BETWEEN

7d62368

Merge remote-tracking branch 'remotes/upstream/master'

76f0c51

update pep8

f080f8d

undo

c6e49bc

undo

d6cc28d

update

aab51ef

update

b6f1879

viirya reviewed May 4, 2015
View reviewed changes

rxin reviewed May 6, 2015
View reviewed changes

Remove useless code.

571869e

rxin reviewed May 6, 2015
View reviewed changes

rxin mentioned this pull request May 11, 2015

[SPARK-7324][SQL] DataFrame.dropDuplicates #6066

Closed

Update

1de8791

asfgit closed this in b6bf4f7 May 12, 2015

[SPARK-7324][SQL] Add DataFrame.dropDuplicates #5870

[SPARK-7324][SQL] Add DataFrame.dropDuplicates #5870

Uh oh!

Conversation

kaka1992 commented May 3, 2015

Uh oh!

AmplabJenkins commented May 3, 2015

Uh oh!

viirya May 4, 2015

Choose a reason for hiding this comment

Uh oh!

kaka1992 May 4, 2015

Choose a reason for hiding this comment

Uh oh!

viirya May 4, 2015

Choose a reason for hiding this comment

Uh oh!

kaka1992 May 6, 2015

Choose a reason for hiding this comment

Uh oh!

kaka1992 commented May 5, 2015

Uh oh!

kaka1992 commented May 6, 2015

Uh oh!

rxin May 6, 2015

Choose a reason for hiding this comment

Uh oh!

rxin commented May 6, 2015

Uh oh!

AmplabJenkins commented May 6, 2015

Uh oh!

AmplabJenkins commented May 6, 2015

Uh oh!

SparkQA commented May 6, 2015

Uh oh!

SparkQA commented May 6, 2015

Uh oh!

AmplabJenkins commented May 6, 2015

Uh oh!

AmplabJenkins commented May 6, 2015

Uh oh!

kaka1992 commented May 6, 2015

Uh oh!

rxin May 6, 2015

Choose a reason for hiding this comment

Uh oh!

adrian-wang commented May 7, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants