Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Aug 22, 2014

RDD.histogram(buckets)

    Compute a histogram using the provided buckets. The buckets
    are all open to the right except for the last which is closed.
    e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
    which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
    and 50 we would have a histogram of 1,0,1.

    If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
    this can be switched from an O(log n) inseration to O(1) per
    element(where n = # buckets).

    Buckets must be sorted and not contain any duplicates, must be
    at least two elements.

    If `buckets` is a number, it will generates buckets which is
    evenly spaced between the minimum and maximum of the RDD. For
    example, if the min value is 0 and the max is 100, given buckets
    as 2, the resulting buckets will be [0,50) [50,100]. buckets must
    be at least 1 If the RDD contains infinity, NaN throws an exception
    If the elements in RDD do not vary (max == min) always returns
    a single bucket.

    It will return an tuple of buckets and histogram.

    >>> rdd = sc.parallelize(range(51))
    >>> rdd.histogram(2)
    ([0, 25, 50], [25, 26])
    >>> rdd.histogram([0, 5, 25, 50])
    ([0, 5, 25, 50], [5, 20, 26])
    >>> rdd.histogram([0, 15, 30, 45, 60], True)
    ([0, 15, 30, 45, 60], [15, 15, 15, 6])
    >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
    >>> rdd.histogram(("a", "b", "c"))
    (('a', 'b', 'c'), [2, 2])

closes #122, it's duplicated.

@SparkQA
Copy link

SparkQA commented Aug 22, 2014

QA tests have started for PR 2091 at commit 0e18a2d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 22, 2014

QA tests have finished for PR 2091 at commit 0e18a2d.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "number of buckets must be >= 1" ?

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have started for PR 2091 at commit d9a0722.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have finished for PR 2091 at commit d9a0722.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Aug 23, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have started for PR 2091 at commit d9a0722.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have finished for PR 2091 at commit d9a0722.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

These are excellent unit tests. I ran a coverage report with coverage.py and it reports essentially 100% coverage for the driver-side histogram code in rdd.py.

export PYTHONPATH=$SPARK_HOME/python
coverage run python/pyspark/tests.py TestRDDFunctions.test_histogram
coverage html

@davies
Copy link
Contributor Author

davies commented Aug 24, 2014

@mateiz @JoshRosen I would like to change evenBuckets to even, the later one is meaningful enough and much shorter.

One concern is that we will have different names in Scala/Java and Python, for same parameter. These languages are much different, Java does not have named parameters, change the name does not break compatibility, Scala do have named parameters, but in this case, we have three API for histgram(), it's hardly that user will call it like this:

rdd.histgram(buckets, evenBuckets=True)

In Python, the buckets can be int, it's already different than that in Scala/Java, so having evenBuckets as even also will not be a surprise here.

Here's my 2 cents, how do you think?

@JoshRosen
Copy link
Contributor

Why do we even need evenBuckets? Can't we just check whether the buckets are evenly-spaced and automatically perform the optimization if they are? This only requires a linear scan through the buckets array. I suppose that we could run into problems if the buckets are evenly-spaced but small rounding / precision errors cause us to mark the buckets as non-even.

/cc @holdenk who implemented the DoublePairRDDFunctions.histogram() function.

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have started for PR 2091 at commit 84e85fa.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Aug 25, 2014

@JoshRosen I had removed evenBuckets, also added more tests, and some test cases for str type.

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have finished for PR 2091 at commit 84e85fa.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have started for PR 2091 at commit a322f8a.

  • This patch merges cleanly.

@holdenk
Copy link
Contributor

holdenk commented Aug 26, 2014

@JoshRosen sure doing a linear scan works, the evenBuckets was because the caller knows if its providing even buckets.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have finished for PR 2091 at commit a322f8a.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mateiz
Copy link
Contributor

mateiz commented Aug 26, 2014

When you guys merge this, please close #122 as well. You should just edit the pull request description to also say "closes #122", and then it will close it when it merges. Let the person there know that this is now implemented as well, since that was a separate implementation.

BTW regarding evenBuckets -- I think it's more or less an accident that we exposed it in Scala, so I'd leave it out here. One awkward thing is that the Scala one is probably not even precise; it's not super clear that the floating-point math will work out such that each bucket's start / end is exactly the same result you get when doing the O(1) bucket computation.

@davies
Copy link
Contributor Author

davies commented Aug 26, 2014

done.

@mattf
Copy link

mattf commented Aug 26, 2014

lgtm

@JoshRosen
Copy link
Contributor

LGTM. Merged into master and branch-1.1 (since it only adds new methods and doesn't modify any existing code).

asfgit pushed a commit that referenced this pull request Aug 26, 2014
RDD.histogram(buckets)

        Compute a histogram using the provided buckets. The buckets
        are all open to the right except for the last which is closed.
        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
        and 50 we would have a histogram of 1,0,1.

        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
        this can be switched from an O(log n) inseration to O(1) per
        element(where n = # buckets).

        Buckets must be sorted and not contain any duplicates, must be
        at least two elements.

        If `buckets` is a number, it will generates buckets which is
        evenly spaced between the minimum and maximum of the RDD. For
        example, if the min value is 0 and the max is 100, given buckets
        as 2, the resulting buckets will be [0,50) [50,100]. buckets must
        be at least 1 If the RDD contains infinity, NaN throws an exception
        If the elements in RDD do not vary (max == min) always returns
        a single bucket.

        It will return an tuple of buckets and histogram.

        >>> rdd = sc.parallelize(range(51))
        >>> rdd.histogram(2)
        ([0, 25, 50], [25, 26])
        >>> rdd.histogram([0, 5, 25, 50])
        ([0, 5, 25, 50], [5, 20, 26])
        >>> rdd.histogram([0, 15, 30, 45, 60], True)
        ([0, 15, 30, 45, 60], [15, 15, 15, 6])
        >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
        >>> rdd.histogram(("a", "b", "c"))
        (('a', 'b', 'c'), [2, 2])

closes #122, it's duplicated.

Author: Davies Liu <[email protected]>

Closes #2091 from davies/histgram and squashes the following commits:

a322f8a [Davies Liu] fix deprecation of e.message
84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
d9a0722 [Davies Liu] address comments
0e18a2d [Davies Liu] add histgram() API

(cherry picked from commit 3cedc4f)
Signed-off-by: Josh Rosen <[email protected]>
@asfgit asfgit closed this in 3cedc4f Aug 26, 2014
kayousterhout pushed a commit to kayousterhout/spark-1 that referenced this pull request Aug 27, 2014
RDD.histogram(buckets)

        Compute a histogram using the provided buckets. The buckets
        are all open to the right except for the last which is closed.
        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
        and 50 we would have a histogram of 1,0,1.

        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
        this can be switched from an O(log n) inseration to O(1) per
        element(where n = # buckets).

        Buckets must be sorted and not contain any duplicates, must be
        at least two elements.

        If `buckets` is a number, it will generates buckets which is
        evenly spaced between the minimum and maximum of the RDD. For
        example, if the min value is 0 and the max is 100, given buckets
        as 2, the resulting buckets will be [0,50) [50,100]. buckets must
        be at least 1 If the RDD contains infinity, NaN throws an exception
        If the elements in RDD do not vary (max == min) always returns
        a single bucket.

        It will return an tuple of buckets and histogram.

        >>> rdd = sc.parallelize(range(51))
        >>> rdd.histogram(2)
        ([0, 25, 50], [25, 26])
        >>> rdd.histogram([0, 5, 25, 50])
        ([0, 5, 25, 50], [5, 20, 26])
        >>> rdd.histogram([0, 15, 30, 45, 60], True)
        ([0, 15, 30, 45, 60], [15, 15, 15, 6])
        >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
        >>> rdd.histogram(("a", "b", "c"))
        (('a', 'b', 'c'), [2, 2])

closes apache#122, it's duplicated.

Author: Davies Liu <[email protected]>

Closes apache#2091 from davies/histgram and squashes the following commits:

a322f8a [Davies Liu] fix deprecation of e.message
84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
d9a0722 [Davies Liu] address comments
0e18a2d [Davies Liu] add histgram() API
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
RDD.histogram(buckets)

        Compute a histogram using the provided buckets. The buckets
        are all open to the right except for the last which is closed.
        e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
        which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
        and 50 we would have a histogram of 1,0,1.

        If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
        this can be switched from an O(log n) inseration to O(1) per
        element(where n = # buckets).

        Buckets must be sorted and not contain any duplicates, must be
        at least two elements.

        If `buckets` is a number, it will generates buckets which is
        evenly spaced between the minimum and maximum of the RDD. For
        example, if the min value is 0 and the max is 100, given buckets
        as 2, the resulting buckets will be [0,50) [50,100]. buckets must
        be at least 1 If the RDD contains infinity, NaN throws an exception
        If the elements in RDD do not vary (max == min) always returns
        a single bucket.

        It will return an tuple of buckets and histogram.

        >>> rdd = sc.parallelize(range(51))
        >>> rdd.histogram(2)
        ([0, 25, 50], [25, 26])
        >>> rdd.histogram([0, 5, 25, 50])
        ([0, 5, 25, 50], [5, 20, 26])
        >>> rdd.histogram([0, 15, 30, 45, 60], True)
        ([0, 15, 30, 45, 60], [15, 15, 15, 6])
        >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
        >>> rdd.histogram(("a", "b", "c"))
        (('a', 'b', 'c'), [2, 2])

closes apache#122, it's duplicated.

Author: Davies Liu <[email protected]>

Closes apache#2091 from davies/histgram and squashes the following commits:

a322f8a [Davies Liu] fix deprecation of e.message
84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
d9a0722 [Davies Liu] address comments
0e18a2d [Davies Liu] add histgram() API
@davies davies deleted the histgram branch September 15, 2014 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants