SPARK-1109 wrong API docs for pyspark map function #73

ScrapCodes · 2014-03-04T13:02:57Z

No description provided.

AmplabJenkins · 2014-03-04T13:24:24Z

Merged build triggered.

AmplabJenkins · 2014-03-04T13:24:24Z

Merged build started.

Approximate distinct count Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.

AmplabJenkins · 2014-03-04T14:23:01Z

Merged build finished.

AmplabJenkins · 2014-03-04T14:23:02Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/12992/

mateiz · 2014-03-04T23:34:06Z

Thanks, merged in 0.9 and master

Author: Prashant Sharma <[email protected]> Closes #73 from ScrapCodes/SPARK-1109/wrong-API-docs and squashes the following commits: 1a55b58 [Prashant Sharma] SPARK-1109 wrong API docs for pyspark map function (cherry picked from commit 0283665) Signed-off-by: Matei Zaharia <[email protected]>

…which supports partial aggregation. This is cherry-pick of feature on open source master branch (hash: databricks/runtime@f003e0c). ## What changes were proposed in this pull request? This PR implements aggregation function `percentile_approx`. Function `percentile_approx` returns the approximate percentile(s) of a column at the given percentage(s). A percentile is a watermark value below which a given percentage of the column values fall. For example, the percentile of column `col` at percentage 50% is the median value of column `col`. ### Syntax: ``` # Returns percentile at a given percentage value. The approximation error can be reduced by increasing parameter accuracy, at the cost of memory. percentile_approx(col, percentage [, accuracy]) # Returns percentile value array at given percentage value array percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) ``` ### Features: 1. This function supports partial aggregation. 2. The memory consumption is bounded. The larger `accuracy` parameter we choose, we smaller error we get. The default accuracy value is 10000, to match with Hive default setting. Choose a smaller value for smaller memory footprint. 3. This function supports window function aggregation. ### Example usages: ``` ## Returns the 25th percentile value, with default accuracy SELECT percentile_approx(col, 0.25) FROM table ## Returns an array of percentile value (25th, 50th, 75th), with default accuracy SELECT percentile_approx(col, array(0.25, 0.5, 0.75)) FROM table ## Returns 25th percentile value, with custom accuracy value 100, larger accuracy parameter yields smaller approximation error SELECT percentile_approx(col, 0.25, 100) FROM table ## Returns the 25th, and 50th percentile values, with custom accuracy value 100 SELECT percentile_approx(col, array(0.25, 0.5), 100) FROM table ``` ### NOTE: 1. The `percentile_approx` implementation is different from Hive, so the result returned on same query maybe slightly different with Hive. This implementation uses `QuantileSummaries` as the underlying probabilistic data structure, and mainly follows paper `Space-efficient Online Computation of Quantile Summaries` by Greenwald, Michael and Khanna, Sanjeev. (http://dx.doi.org/10.1145/375663.375670)` 2. The current implementation of `QuantileSummaries` doesn't support automatic compression. This PR has a rule to do compression automatically at the caller side, but it may not be optimal. ## How was this patch tested? Unit test, and Sql query test. ## Acknowledgement 1. This PR's work in based on lw-lin's PR apache#14298, with improvements like supporting partial aggregation, fixing out of memory issue. Author: Sean Zhong <seanzhongdatabricks.com> Closes apache#14868 from clockfly/appro_percentile_try_2. Author: Sean Zhong <[email protected]> Closes apache#73 from clockfly/appro_percentile_branch_2.0.

* Added Python API for mapr-streaming (kafka 0.9) Signed-off-by: Rostyslav Sotnychenko <[email protected]> (cherry picked from commit c7de39f)

…t instead (apache#73) Novaclient's add_floating_ip function is deprecated use neutron client instead

* apache#73 support limit offset

SPARK-1109 wrong API docs for pyspark map function

1a55b58

asfgit closed this in 0283665 Mar 5, 2014

ScrapCodes deleted the SPARK-1109/wrong-API-docs branch June 3, 2015 06:00

robert3005 added a commit to robert3005/spark that referenced this pull request Jan 12, 2017

Merge pull request apache#73 from palantir/robertk/merge-upstream

24f060b

jlopezmalla pushed a commit to jlopezmalla/spark that referenced this pull request Nov 3, 2017

Removed mesos secret and mesos principal (apache#73)

fb09588

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Novaclient's add_floating_ip function is deprecated use neutron clien…

d14d8ff

…t instead (apache#73) Novaclient's add_floating_ip function is deprecated use neutron client instead

hn5092 added a commit to hn5092/spark that referenced this pull request Nov 21, 2019

apache#73 support limit offset

04caba5

hn5092 pushed a commit to hn5092/spark that referenced this pull request Nov 29, 2019

apache#80 handle data skew and other commits list below

72a232e

* apache#73 support limit offset

hn5092 added a commit to hn5092/spark that referenced this pull request Nov 29, 2019

apache#73 [follow up] support limit offset

0ed15bd

yuexingri pushed a commit to yuexingri/spark that referenced this pull request Dec 9, 2019

apache#73 support limit offset

8071455

yuexingri pushed a commit to yuexingri/spark that referenced this pull request Dec 9, 2019

apache#73 [follow up] support limit offset

847cdd7

maropu mentioned this pull request May 5, 2020

[SPARK-31590][SQL] Metadata-only queries should not include subquery in partition filters #28383

Closed

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

Spark 2.0.1 MAPR-streams Python API (apache#73)

ff0be9e

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025

[CARMEL-7266] Backport zeta support in viewpoint server (apache#73)

e12309d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SPARK-1109 wrong API docs for pyspark map function #73

SPARK-1109 wrong API docs for pyspark map function #73

Uh oh!

ScrapCodes commented Mar 4, 2014

Uh oh!

AmplabJenkins commented Mar 4, 2014

Uh oh!

AmplabJenkins commented Mar 4, 2014

Uh oh!

AmplabJenkins commented Mar 4, 2014

Uh oh!

AmplabJenkins commented Mar 4, 2014

Uh oh!

mateiz commented Mar 4, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SPARK-1109 wrong API docs for pyspark map function #73

SPARK-1109 wrong API docs for pyspark map function #73

Uh oh!

Conversation

ScrapCodes commented Mar 4, 2014

Uh oh!

AmplabJenkins commented Mar 4, 2014

Uh oh!

AmplabJenkins commented Mar 4, 2014

Uh oh!

AmplabJenkins commented Mar 4, 2014

Uh oh!

AmplabJenkins commented Mar 4, 2014

Uh oh!

mateiz commented Mar 4, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants