Skip to content

Conversation

@kaka1992
Copy link
Contributor

@kaka1992 kaka1992 commented May 3, 2015

Similar to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
def dropDuplicates(): DataFrame
def dropDuplicates(subset: Seq[String]): DataFrame

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose that distinct is as same as dropDuplicates for removing duplicate rows? If they are the same, which implementation is better? GroupedData or Distinct node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya No, dropDuplicates is used to remove duplicate rows that are the same in some columns or in all columns (default) . The default version is as same as distinct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also select subset of columns and then do distinct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do this, you can't get all columns.

@kaka1992
Copy link
Contributor Author

kaka1992 commented May 5, 2015

@viirya Please test this.

@kaka1992
Copy link
Contributor Author

kaka1992 commented May 6, 2015

@rxin Please test this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just

groupBy(subset : _*).agg(columns.map(columnFirst) : _*)

(you might need to take the head and then vararg the tail)

@rxin
Copy link
Contributor

rxin commented May 6, 2015

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #31935 has started for PR 5870 at commit b6f1879.

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #31935 has finished for PR 5870 at commit b6f1879.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31935/
Test PASSed.

@kaka1992
Copy link
Contributor Author

kaka1992 commented May 6, 2015

@rxin Please retest this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also check if subset.size == 0 or columns.size == 0, then simply return an empty data frame (there is one in SQLContext).

@adrian-wang
Copy link
Contributor

I think keep the takeFirst parameter would make this better to understand.

@asfgit asfgit closed this in b6bf4f7 May 12, 2015
asfgit pushed a commit that referenced this pull request May 12, 2015
This should also close #5870

Author: Reynold Xin <[email protected]>

Closes #6066 from rxin/dropDups and squashes the following commits:

130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates

(cherry picked from commit b6bf4f7)
Signed-off-by: Michael Armbrust <[email protected]>
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
This should also close apache#5870

Author: Reynold Xin <[email protected]>

Closes apache#6066 from rxin/dropDups and squashes the following commits:

130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
This should also close apache#5870

Author: Reynold Xin <[email protected]>

Closes apache#6066 from rxin/dropDups and squashes the following commits:

130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
This should also close apache#5870

Author: Reynold Xin <[email protected]>

Closes apache#6066 from rxin/dropDups and squashes the following commits:

130692f [Reynold Xin] [SPARK-7324][SQL] DataFrame.dropDuplicates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants