-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-2871] [PySpark] add key argument for max(), min() and top(n)
#2094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2094 at commit
|
|
QA tests have finished for PR 2094 at commit
|
python/pyspark/rdd.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - the buildin 'max'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cmp is the function used in max or min, so cmp is the default value for comp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmp may be used in max, but for this func the default is on line 829. either way, a minor nitpick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using comp here is bit confusing. The builtin min use key, it will be better for Python programer, but it will be different than Scala API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already use key in Python instead of Ordering in Scala, so I had change it into key.
Also , I would like to add key to top(), will be helpful, such as:
rdd.map(lambda x: (x, 1)).reduce(add).top(20, key=itemgetter(1))
We already have ord in Scala. Should I add this in this PR?
|
are you planning to add tests for these? |
|
@mattf thank you for reviewing this, I think the docs tests is enough, they have cover the cases w or w/o |
python/pyspark/rdd.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider default of comp=min in arg list and test for comp is not min
same for max method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min and comp have different meanings:
>>> min(1, 2)
1
>>> cmp(1, 2)
-1
|
agreed re doctest. i forgot it was in use. |
|
QA tests have started for PR 2094 at commit
|
|
QA tests have finished for PR 2094 at commit
|
|
QA tests have started for PR 2094 at commit
|
|
QA tests have started for PR 2094 at commit
|
|
QA tests have finished for PR 2094 at commit
|
|
QA tests have finished for PR 2094 at commit
|
|
I like this updated approach of using |
comp argument for RDD.max() and RDD.min()key argument for max(), min() and top(n)
|
I've merged this into master. Thanks! |
RDD.max(key=None)
param key: A function used to generate key for comparing
>>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
>>> rdd.max()
43.0
>>> rdd.max(key=str)
5.0
RDD.min(key=None)
Find the minimum item in this RDD.
param key: A function used to generate key for comparing
>>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
>>> rdd.min()
2.0
>>> rdd.min(key=str)
10.0
RDD.top(num, key=None)
Get the top N elements from a RDD.
Note: It returns the list sorted in descending order.
>>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
[12]
>>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
[6, 5]
>>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
[4, 3, 2]
Author: Davies Liu <[email protected]>
Closes apache#2094 from davies/cmp and squashes the following commits:
ccbaf25 [Davies Liu] add `key` to top()
ad7e374 [Davies Liu] fix tests
2f63512 [Davies Liu] change `comp` to `key` in min/max
dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min()

RDD.max(key=None)
RDD.min(key=None)
RDD.top(num, key=None)