[SPARK-2871] [PySpark] add countApproxDistinct() API #2142

davies · 2014-08-26T18:51:42Z

RDD.countApproxDistinct(relativeSD=0.05):

    :: Experimental ::
    Return approximate number of distinct elements in the RDD.

    The algorithm used is based on streamlib's implementation of
    "HyperLogLog in Practice: Algorithmic Engineering of a State
    of The Art Cardinality Estimation Algorithm", available
    <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.

    This support all the types of objects, which is supported by
    Pyrolite, nearly all builtin types.

    @param relativeSD Relative accuracy. Smaller values create
                       counters that require more space.
                       It must be greater than 0.000017.

    >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
    >>> 950 < n < 1050
    True
    >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
    >>> 18 < n < 22
    True

Conflicts: python/pyspark/tests.py

davies · 2014-08-27T22:03:28Z

@mateiz @JoshRosen this one was separated from #1791, please take a look at it. ( @mateiz had reviewed this part in that PR, sorry for the duplicated reviewing).

rxin · 2014-08-28T01:47:28Z

python/pyspark/rdd.py

can u add a test to make sure that if u have 1000 non-distinct elements (i.e. the same element appearing 1000 times), this doesn't return ~ 1000?

Asking because I'm not sure how pyspark interacts with Java - if it is through byte array, then the hashcode could be wrong for byte arrays.

Good point, it helped me to find out an invalid test case of tuple, which will be unpickled as []Object in Java, and the hashCode of it is not determined by content, so I changed it into set([]), which should have similar behavior across Python and Java.

I had changed it to do hash calculation in Python, so it can support all hashable types in Python.

SparkQA · 2014-08-28T03:26:06Z

QA tests have started for PR 2142 at commit dfd2a2a.

This patch merges cleanly.

SparkQA · 2014-08-28T03:31:21Z

QA tests have started for PR 2142 at commit 4cba98f.

This patch merges cleanly.

SparkQA · 2014-08-28T04:19:30Z

QA tests have finished for PR 2142 at commit 4cba98f.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-28T04:20:04Z

QA tests have finished for PR 2142 at commit dfd2a2a.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-28T19:24:09Z

QA tests have started for PR 2142 at commit ded624f.

This patch merges cleanly.

SparkQA · 2014-08-28T19:34:19Z

QA tests have started for PR 2142 at commit d306492.

This patch merges cleanly.

SparkQA · 2014-08-28T19:49:03Z

QA tests have started for PR 2142 at commit 9d2565f.

This patch merges cleanly.

SparkQA · 2014-08-28T20:10:46Z

QA tests have finished for PR 2142 at commit ded624f.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-28T20:29:21Z

QA tests have finished for PR 2142 at commit d306492.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-28T20:37:29Z

QA tests have finished for PR 2142 at commit 9d2565f.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-28T20:44:12Z

QA tests have started for PR 2142 at commit 2ab157c.

This patch merges cleanly.

SparkQA · 2014-08-28T21:31:13Z

QA tests have finished for PR 2142 at commit 2ab157c.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-28T21:44:09Z

QA tests have started for PR 2142 at commit c38c4e4.

This patch merges cleanly.

SparkQA · 2014-08-28T22:37:48Z

QA tests have finished for PR 2142 at commit c38c4e4.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-08-29T18:50:26Z

python/pyspark/rdd.py

Why this change?

In 64-bit machines, the hash of tuple() should be 64 bits (sys.maxint is 64bits).

JoshRosen · 2014-08-29T22:01:28Z

To check my understanding of the error-correction code:

Due to hash collisions, we may underestimate the true number of distinct items.
Given a set of k random 32-bit hashes, the exact probability of at least one collision (from your link) is 1 - e^(-k(k-1)/2*2^32). We can approximate 1 - e^X by X for small X. Therefore, the approximate probability of some hash collision is -k(k-1)/2*2^32, which is roughly k^2/2^33.

I'm a bit confused about how the current correction term works:

c = - sys.maxint * log(1 - float(c) / sys.maxint)

It looks like this is correcting for overestimates of the number of distinct elements by subtracting a term based on the collision probability. In general, won't collisions cause us underestimate instead of overestimating?

Maybe we should approach this by treating the true number of distinct items (k) as a random variable and figuring out the maximum likelihood estimator of k given an observation c of the result of countApproxDistinct.

Before we consider that, though, I wonder whether we even need a correction term. Doesn't the Java implementation of countApproxDistinct already introduce hashing errors that are corrected for in its implementation? I don't think that the two levels of hashing will introduce more error, since I think the hashcode of a Java integer should just be its value.

Note: I'm not a statistician; please correct me if I've gotten anything wrong.

davies · 2014-09-02T19:17:41Z

After discussion with @JoshRosen offline, we realized that it does not need correction in Python if they have the same hash space both in Python and Java, so I changed the has space (mapping to 2^32) and remove the correction.

Jenkins, test this please.

SparkQA · 2014-09-02T19:19:24Z

QA tests have started for PR 2142 at commit e20da47.

This patch merges cleanly.

SparkQA · 2014-09-02T20:21:43Z

QA tests have finished for PR 2142 at commit e20da47.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-09-02T22:38:58Z

Yeah, I don't think we any special correction because Java will use the hashcodes chosen in Python (since Integer.hashcode is just the integer's value).

JoshRosen · 2014-09-02T22:48:05Z

LGTM, so I've merged this into master. Thanks!

RDD.countApproxDistinct(relativeSD=0.05): :: Experimental :: Return approximate number of distinct elements in the RDD. The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>. This support all the types of objects, which is supported by Pyrolite, nearly all builtin types. param relativeSD Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017. >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct() >>> 950 < n < 1050 True >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct() >>> 18 < n < 22 True Author: Davies Liu <[email protected]> Closes apache#2142 from davies/countApproxDistinct and squashes the following commits: e20da47 [Davies Liu] remove the correction in Python c38c4e4 [Davies Liu] fix doc tests 2ab157c [Davies Liu] fix doc tests 9d2565f [Davies Liu] add commments and link for hash collision correction d306492 [Davies Liu] change range of hash of tuple to [0, maxint] ded624f [Davies Liu] calculate hash in Python 4cba98f [Davies Liu] add more tests a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct e97e342 [Davies Liu] add countApproxDistinct()

davies added 2 commits August 26, 2014 11:49

add countApproxDistinct()

e97e342

Merge branch 'master' into countApproxDistinct

a85a8c6

Conflicts: python/pyspark/tests.py

rxin reviewed Aug 28, 2014
View reviewed changes

add more tests

4cba98f

davies force-pushed the countApproxDistinct branch from dfd2a2a to 4cba98f Compare August 28, 2014 03:26

calculate hash in Python

ded624f

davies force-pushed the countApproxDistinct branch from 2541785 to ded624f Compare August 28, 2014 19:19

change range of hash of tuple to [0, maxint]

d306492

add commments and link for hash collision correction

9d2565f

fix doc tests

2ab157c

fix doc tests

c38c4e4

JoshRosen reviewed Aug 29, 2014
View reviewed changes

remove the correction in Python

e20da47

asfgit closed this in e2c901b Sep 2, 2014

davies deleted the countApproxDistinct branch September 15, 2014 22:16

[SPARK-2871] [PySpark] add countApproxDistinct() API #2142

[SPARK-2871] [PySpark] add countApproxDistinct() API #2142

Uh oh!

Conversation

davies commented Aug 26, 2014

Uh oh!

davies commented Aug 27, 2014

Uh oh!

rxin Aug 28, 2014

Choose a reason for hiding this comment

Uh oh!

davies Aug 28, 2014

Choose a reason for hiding this comment

Uh oh!

davies Aug 28, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

SparkQA commented Aug 28, 2014

Uh oh!

JoshRosen Aug 29, 2014

Choose a reason for hiding this comment

Uh oh!

davies Aug 29, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 29, 2014

Uh oh!

davies commented Sep 2, 2014

Uh oh!

SparkQA commented Sep 2, 2014

Uh oh!

SparkQA commented Sep 2, 2014

Uh oh!

JoshRosen commented Sep 2, 2014

Uh oh!

JoshRosen commented Sep 2, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants