[SPARK-2871] [PySpark] add histgram() API #2091

davies · 2014-08-22T04:35:44Z

RDD.histogram(buckets)

    Compute a histogram using the provided buckets. The buckets
    are all open to the right except for the last which is closed.
    e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
    which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
    and 50 we would have a histogram of 1,0,1.

    If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
    this can be switched from an O(log n) inseration to O(1) per
    element(where n = # buckets).

    Buckets must be sorted and not contain any duplicates, must be
    at least two elements.

    If `buckets` is a number, it will generates buckets which is
    evenly spaced between the minimum and maximum of the RDD. For
    example, if the min value is 0 and the max is 100, given buckets
    as 2, the resulting buckets will be [0,50) [50,100]. buckets must
    be at least 1 If the RDD contains infinity, NaN throws an exception
    If the elements in RDD do not vary (max == min) always returns
    a single bucket.

    It will return an tuple of buckets and histogram.

    >>> rdd = sc.parallelize(range(51))
    >>> rdd.histogram(2)
    ([0, 25, 50], [25, 26])
    >>> rdd.histogram([0, 5, 25, 50])
    ([0, 5, 25, 50], [5, 20, 26])
    >>> rdd.histogram([0, 15, 30, 45, 60], True)
    ([0, 15, 30, 45, 60], [15, 15, 15, 6])
    >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
    >>> rdd.histogram(("a", "b", "c"))
    (('a', 'b', 'c'), [2, 2])

closes #122, it's duplicated.

SparkQA · 2014-08-22T04:40:34Z

QA tests have started for PR 2091 at commit 0e18a2d.

This patch merges cleanly.

SparkQA · 2014-08-22T05:35:49Z

QA tests have finished for PR 2091 at commit 0e18a2d.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mattf · 2014-08-23T01:02:11Z

python/pyspark/rdd.py

how about "number of buckets must be >= 1" ?

SparkQA · 2014-08-23T05:30:43Z

QA tests have started for PR 2091 at commit d9a0722.

This patch merges cleanly.

SparkQA · 2014-08-23T06:16:07Z

QA tests have finished for PR 2091 at commit d9a0722.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2014-08-23T07:44:24Z

Jenkins, retest this please.

SparkQA · 2014-08-23T07:50:44Z

QA tests have started for PR 2091 at commit d9a0722.

This patch merges cleanly.

SparkQA · 2014-08-23T08:45:08Z

QA tests have finished for PR 2091 at commit d9a0722.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-08-24T00:59:47Z

These are excellent unit tests. I ran a coverage report with coverage.py and it reports essentially 100% coverage for the driver-side histogram code in rdd.py.

export PYTHONPATH=$SPARK_HOME/python
coverage run python/pyspark/tests.py TestRDDFunctions.test_histogram
coverage html

davies · 2014-08-24T06:39:35Z

@mateiz @JoshRosen I would like to change evenBuckets to even, the later one is meaningful enough and much shorter.

One concern is that we will have different names in Scala/Java and Python, for same parameter. These languages are much different, Java does not have named parameters, change the name does not break compatibility, Scala do have named parameters, but in this case, we have three API for histgram(), it's hardly that user will call it like this:

rdd.histgram(buckets, evenBuckets=True)

In Python, the buckets can be int, it's already different than that in Scala/Java, so having evenBuckets as even also will not be a surprise here.

Here's my 2 cents, how do you think?

JoshRosen · 2014-08-25T18:17:54Z

Why do we even need evenBuckets? Can't we just check whether the buckets are evenly-spaced and automatically perform the optimization if they are? This only requires a linear scan through the buckets array. I suppose that we could run into problems if the buckets are evenly-spaced but small rounding / precision errors cause us to mark the buckets as non-even.

/cc @holdenk who implemented the DoublePairRDDFunctions.histogram() function.

SparkQA · 2014-08-25T22:21:06Z

QA tests have started for PR 2091 at commit 84e85fa.

This patch merges cleanly.

davies · 2014-08-25T22:21:06Z

@JoshRosen I had removed evenBuckets, also added more tests, and some test cases for str type.

SparkQA · 2014-08-25T23:10:03Z

QA tests have finished for PR 2091 at commit 84e85fa.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-25T23:55:53Z

QA tests have started for PR 2091 at commit a322f8a.

This patch merges cleanly.

holdenk · 2014-08-26T00:04:32Z

@JoshRosen sure doing a linear scan works, the evenBuckets was because the caller knows if its providing even buckets.

SparkQA · 2014-08-26T00:50:19Z

QA tests have finished for PR 2091 at commit a322f8a.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mateiz · 2014-08-26T04:54:55Z

When you guys merge this, please close #122 as well. You should just edit the pull request description to also say "closes #122", and then it will close it when it merges. Let the person there know that this is now implemented as well, since that was a separate implementation.

BTW regarding evenBuckets -- I think it's more or less an accident that we exposed it in Scala, so I'd leave it out here. One awkward thing is that the Scala one is probably not even precise; it's not super clear that the floating-point math will work out such that each bucket's start / end is exactly the same result you get when doing the O(1) bucket computation.

davies · 2014-08-26T05:50:04Z

done.

mattf · 2014-08-26T13:03:43Z

lgtm

JoshRosen · 2014-08-26T20:06:08Z

LGTM. Merged into master and branch-1.1 (since it only adds new methods and doesn't modify any existing code).

RDD.histogram(buckets) Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1. If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per element(where n = # buckets). Buckets must be sorted and not contain any duplicates, must be at least two elements. If `buckets` is a number, it will generates buckets which is evenly spaced between the minimum and maximum of the RDD. For example, if the min value is 0 and the max is 100, given buckets as 2, the resulting buckets will be [0,50) [50,100]. buckets must be at least 1 If the RDD contains infinity, NaN throws an exception If the elements in RDD do not vary (max == min) always returns a single bucket. It will return an tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) ([0, 25, 50], [25, 26]) >>> rdd.histogram([0, 5, 25, 50]) ([0, 5, 25, 50], [5, 20, 26]) >>> rdd.histogram([0, 15, 30, 45, 60], True) ([0, 15, 30, 45, 60], [15, 15, 15, 6]) >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"]) >>> rdd.histogram(("a", "b", "c")) (('a', 'b', 'c'), [2, 2]) closes #122, it's duplicated. Author: Davies Liu <[email protected]> Closes #2091 from davies/histgram and squashes the following commits: a322f8a [Davies Liu] fix deprecation of e.message 84e85fa [Davies Liu] remove evenBuckets, add more tests (including str) d9a0722 [Davies Liu] address comments 0e18a2d [Davies Liu] add histgram() API (cherry picked from commit 3cedc4f) Signed-off-by: Josh Rosen <[email protected]>

RDD.histogram(buckets) Compute a histogram using the provided buckets. The buckets are all open to the right except for the last which is closed. e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50], which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1 and 50 we would have a histogram of 1,0,1. If your histogram is evenly spaced (e.g. [0, 10, 20, 30]), this can be switched from an O(log n) inseration to O(1) per element(where n = # buckets). Buckets must be sorted and not contain any duplicates, must be at least two elements. If `buckets` is a number, it will generates buckets which is evenly spaced between the minimum and maximum of the RDD. For example, if the min value is 0 and the max is 100, given buckets as 2, the resulting buckets will be [0,50) [50,100]. buckets must be at least 1 If the RDD contains infinity, NaN throws an exception If the elements in RDD do not vary (max == min) always returns a single bucket. It will return an tuple of buckets and histogram. >>> rdd = sc.parallelize(range(51)) >>> rdd.histogram(2) ([0, 25, 50], [25, 26]) >>> rdd.histogram([0, 5, 25, 50]) ([0, 5, 25, 50], [5, 20, 26]) >>> rdd.histogram([0, 15, 30, 45, 60], True) ([0, 15, 30, 45, 60], [15, 15, 15, 6]) >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"]) >>> rdd.histogram(("a", "b", "c")) (('a', 'b', 'c'), [2, 2]) closes apache#122, it's duplicated. Author: Davies Liu <[email protected]> Closes apache#2091 from davies/histgram and squashes the following commits: a322f8a [Davies Liu] fix deprecation of e.message 84e85fa [Davies Liu] remove evenBuckets, add more tests (including str) d9a0722 [Davies Liu] address comments 0e18a2d [Davies Liu] add histgram() API

add histgram() API

0e18a2d

davies mentioned this pull request Aug 22, 2014

[SPARK-2871] [PySpark] Add missing API #1791

Closed

mattf reviewed Aug 23, 2014
View reviewed changes

python/pyspark/rdd.py Outdated

Copy link

mattf Aug 23, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "number of buckets must be >= 1" ?

address comments

d9a0722

remove evenBuckets, add more tests (including str)

84e85fa

fix deprecation of e.message

a322f8a

asfgit closed this in 3cedc4f Aug 26, 2014

davies deleted the histgram branch September 15, 2014 22:20

[SPARK-2871] [PySpark] add histgram() API #2091

[SPARK-2871] [PySpark] add histgram() API #2091

Conversation

davies commented Aug 22, 2014

Uh oh!

SparkQA commented Aug 22, 2014

Uh oh!

SparkQA commented Aug 22, 2014

Uh oh!

mattf Aug 23, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 23, 2014

Uh oh!

SparkQA commented Aug 23, 2014

Uh oh!

davies commented Aug 23, 2014

Uh oh!

SparkQA commented Aug 23, 2014

Uh oh!

SparkQA commented Aug 23, 2014

Uh oh!

JoshRosen commented Aug 24, 2014

Uh oh!

davies commented Aug 24, 2014

Uh oh!

JoshRosen commented Aug 25, 2014

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

davies commented Aug 25, 2014

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

holdenk commented Aug 26, 2014

Uh oh!

SparkQA commented Aug 26, 2014

Uh oh!

mateiz commented Aug 26, 2014

Uh oh!

davies commented Aug 26, 2014

Uh oh!

mattf commented Aug 26, 2014

Uh oh!

JoshRosen commented Aug 26, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants