Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Aug 22, 2014

RDD.max(key=None)

    @param key: A function used to generate key for comparing

    >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
    >>> rdd.max()
    43.0
    >>> rdd.max(key=str)
    5.0

RDD.min(key=None)

    Find the minimum item in this RDD.

    @param key: A function used to generate key for comparing

    >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
    >>> rdd.min()
    2.0
    >>> rdd.min(key=str)
    10.0

RDD.top(num, key=None)

    Get the top N elements from a RDD.

    Note: It returns the list sorted in descending order.
    >>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
    [12]
    >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
    [6, 5]
    >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
    [4, 3, 2]

@SparkQA
Copy link

SparkQA commented Aug 22, 2014

QA tests have started for PR 2094 at commit dd91e08.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 22, 2014

QA tests have finished for PR 2094 at commit dd91e08.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - the buildin 'max'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cmp is the function used in max or min, so cmp is the default value for comp.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmp may be used in max, but for this func the default is on line 829. either way, a minor nitpick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using comp here is bit confusing. The builtin min use key, it will be better for Python programer, but it will be different than Scala API.

cc @mateiz @rxin @JoshRosen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already use key in Python instead of Ordering in Scala, so I had change it into key.

Also , I would like to add key to top(), will be helpful, such as:

rdd.map(lambda x: (x, 1)).reduce(add).top(20, key=itemgetter(1))

We already have ord in Scala. Should I add this in this PR?

@mattf
Copy link

mattf commented Aug 23, 2014

are you planning to add tests for these?

@davies
Copy link
Contributor Author

davies commented Aug 23, 2014

@mattf thank you for reviewing this, I think the docs tests is enough, they have cover the cases w or w/o comp, which kinds of tests should be added?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider default of comp=min in arg list and test for comp is not min

same for max method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min and comp have different meanings:

>>> min(1, 2)
1
>>> cmp(1, 2)
-1

@mattf
Copy link

mattf commented Aug 23, 2014

agreed re doctest. i forgot it was in use.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have started for PR 2094 at commit 2f63512.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have finished for PR 2094 at commit 2f63512.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have started for PR 2094 at commit ad7e374.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have started for PR 2094 at commit ccbaf25.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have finished for PR 2094 at commit ccbaf25.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have finished for PR 2094 at commit ad7e374.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Epydoc renders docstrings + @params kind of oddly, but I don't think it's a big deal:

image

In the long run, we might want to move to Sphinx, since that seems to be what's popular with most major Python projects.

@JoshRosen
Copy link
Contributor

I like this updated approach of using key instead of a comparator, since that's a closer match to Python's min function. Can you update the PR's title and description to reflect this?

@davies davies changed the title [SPARK-2871] [PySpark] add comp argument for RDD.max() and RDD.min() [SPARK-2871] [PySpark] add key argument for max(), min() and top(n) Aug 24, 2014
@JoshRosen
Copy link
Contributor

I've merged this into master. Thanks!

@asfgit asfgit closed this in db436e3 Aug 24, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
RDD.max(key=None)

        param key: A function used to generate key for comparing

        >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
        >>> rdd.max()
        43.0
        >>> rdd.max(key=str)
        5.0

RDD.min(key=None)

        Find the minimum item in this RDD.

        param key: A function used to generate key for comparing

        >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
        >>> rdd.min()
        2.0
        >>> rdd.min(key=str)
        10.0

RDD.top(num, key=None)

        Get the top N elements from a RDD.

        Note: It returns the list sorted in descending order.
        >>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
        [12]
        >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
        [6, 5]
        >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
        [4, 3, 2]

Author: Davies Liu <[email protected]>

Closes apache#2094 from davies/cmp and squashes the following commits:

ccbaf25 [Davies Liu] add `key` to top()
ad7e374 [Davies Liu] fix tests
2f63512 [Davies Liu] change `comp` to `key` in min/max
dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min()
@davies davies deleted the cmp branch September 15, 2014 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants