Skip to content

Conversation

@ScrapCodes
Copy link
Member

No description provided.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

koertkuipers pushed a commit to tresata-opensource/spark that referenced this pull request Mar 4, 2014
Approximate distinct count

Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12992/

@mateiz
Copy link
Contributor

mateiz commented Mar 4, 2014

Thanks, merged in 0.9 and master

asfgit pushed a commit that referenced this pull request Mar 5, 2014
Author: Prashant Sharma <[email protected]>

Closes #73 from ScrapCodes/SPARK-1109/wrong-API-docs and squashes the following commits:

1a55b58 [Prashant Sharma] SPARK-1109 wrong API docs for pyspark map function

(cherry picked from commit 0283665)
Signed-off-by: Matei Zaharia <[email protected]>
@asfgit asfgit closed this in 0283665 Mar 5, 2014
@ScrapCodes ScrapCodes deleted the SPARK-1109/wrong-API-docs branch June 3, 2015 06:00
clockfly added a commit to clockfly/spark that referenced this pull request Sep 22, 2016
…which supports partial aggregation.

This is cherry-pick of feature on open source master branch (hash: databricks/runtime@f003e0c).

## What changes were proposed in this pull request?

This PR implements aggregation function `percentile_approx`. Function `percentile_approx` returns the approximate percentile(s) of a column at the given percentage(s). A percentile is a watermark value below which a given percentage of the column values fall. For example, the percentile of column `col` at percentage 50% is the median value of column `col`.

### Syntax:
```
# Returns percentile at a given percentage value. The approximation error can be reduced by increasing parameter accuracy, at the cost of memory.
percentile_approx(col, percentage [, accuracy])

# Returns percentile value array at given percentage value array
percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy])
```

### Features:
1. This function supports partial aggregation.
2. The memory consumption is bounded. The larger `accuracy` parameter we choose, we smaller error we get. The default accuracy value is 10000, to match with Hive default setting. Choose a smaller value for smaller memory footprint.
3.  This function supports window function aggregation.

### Example usages:
```
## Returns the 25th percentile value, with default accuracy
SELECT percentile_approx(col, 0.25) FROM table

## Returns an array of percentile value (25th, 50th, 75th), with default accuracy
SELECT percentile_approx(col, array(0.25, 0.5, 0.75)) FROM table

## Returns 25th percentile value, with custom accuracy value 100, larger accuracy parameter yields smaller approximation error
SELECT percentile_approx(col, 0.25, 100) FROM table

## Returns the 25th, and 50th percentile values, with custom accuracy value 100
SELECT percentile_approx(col, array(0.25, 0.5), 100) FROM table
```

### NOTE:
1. The `percentile_approx` implementation is different from Hive, so the result returned on same query maybe slightly different with Hive. This implementation uses `QuantileSummaries` as the underlying probabilistic data structure, and mainly follows paper `Space-efficient Online Computation of Quantile Summaries` by Greenwald, Michael and Khanna, Sanjeev. (http://dx.doi.org/10.1145/375663.375670)`
2. The current implementation of `QuantileSummaries` doesn't support automatic compression. This PR has a rule to do compression automatically at the caller side, but it may not be optimal.

## How was this patch tested?

Unit test, and Sql query test.

## Acknowledgement
1. This PR's work in based on lw-lin's PR apache#14298, with improvements like supporting partial aggregation, fixing out of memory issue.

Author: Sean Zhong <seanzhongdatabricks.com>

Closes apache#14868 from clockfly/appro_percentile_try_2.

Author: Sean Zhong <[email protected]>

Closes apache#73 from clockfly/appro_percentile_branch_2.0.
robert3005 added a commit to robert3005/spark that referenced this pull request Jan 12, 2017
jlopezmalla pushed a commit to jlopezmalla/spark that referenced this pull request Nov 3, 2017
jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018
* Added Python API for mapr-streaming (kafka 0.9)

Signed-off-by: Rostyslav Sotnychenko <[email protected]>

(cherry picked from commit c7de39f)
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
…t instead (apache#73)

Novaclient's add_floating_ip function is deprecated use neutron client instead
hn5092 added a commit to hn5092/spark that referenced this pull request Nov 21, 2019
hn5092 pushed a commit to hn5092/spark that referenced this pull request Nov 29, 2019
hn5092 added a commit to hn5092/spark that referenced this pull request Nov 29, 2019
yuexingri pushed a commit to yuexingri/spark that referenced this pull request Dec 9, 2019
yuexingri pushed a commit to yuexingri/spark that referenced this pull request Dec 9, 2019
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants