-
Notifications
You must be signed in to change notification settings - Fork 28.9k
SPARK-1109 wrong API docs for pyspark map function #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
Merged build triggered. |
|
Merged build started. |
koertkuipers
pushed a commit
to tresata-opensource/spark
that referenced
this pull request
Mar 4, 2014
Approximate distinct count Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
|
Merged build finished. |
|
All automated tests passed. |
Contributor
|
Thanks, merged in 0.9 and master |
asfgit
pushed a commit
that referenced
this pull request
Mar 5, 2014
Author: Prashant Sharma <[email protected]> Closes #73 from ScrapCodes/SPARK-1109/wrong-API-docs and squashes the following commits: 1a55b58 [Prashant Sharma] SPARK-1109 wrong API docs for pyspark map function (cherry picked from commit 0283665) Signed-off-by: Matei Zaharia <[email protected]>
clockfly
added a commit
to clockfly/spark
that referenced
this pull request
Sep 22, 2016
…which supports partial aggregation. This is cherry-pick of feature on open source master branch (hash: databricks/runtime@f003e0c). ## What changes were proposed in this pull request? This PR implements aggregation function `percentile_approx`. Function `percentile_approx` returns the approximate percentile(s) of a column at the given percentage(s). A percentile is a watermark value below which a given percentage of the column values fall. For example, the percentile of column `col` at percentage 50% is the median value of column `col`. ### Syntax: ``` # Returns percentile at a given percentage value. The approximation error can be reduced by increasing parameter accuracy, at the cost of memory. percentile_approx(col, percentage [, accuracy]) # Returns percentile value array at given percentage value array percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) ``` ### Features: 1. This function supports partial aggregation. 2. The memory consumption is bounded. The larger `accuracy` parameter we choose, we smaller error we get. The default accuracy value is 10000, to match with Hive default setting. Choose a smaller value for smaller memory footprint. 3. This function supports window function aggregation. ### Example usages: ``` ## Returns the 25th percentile value, with default accuracy SELECT percentile_approx(col, 0.25) FROM table ## Returns an array of percentile value (25th, 50th, 75th), with default accuracy SELECT percentile_approx(col, array(0.25, 0.5, 0.75)) FROM table ## Returns 25th percentile value, with custom accuracy value 100, larger accuracy parameter yields smaller approximation error SELECT percentile_approx(col, 0.25, 100) FROM table ## Returns the 25th, and 50th percentile values, with custom accuracy value 100 SELECT percentile_approx(col, array(0.25, 0.5), 100) FROM table ``` ### NOTE: 1. The `percentile_approx` implementation is different from Hive, so the result returned on same query maybe slightly different with Hive. This implementation uses `QuantileSummaries` as the underlying probabilistic data structure, and mainly follows paper `Space-efficient Online Computation of Quantile Summaries` by Greenwald, Michael and Khanna, Sanjeev. (http://dx.doi.org/10.1145/375663.375670)` 2. The current implementation of `QuantileSummaries` doesn't support automatic compression. This PR has a rule to do compression automatically at the caller side, but it may not be optimal. ## How was this patch tested? Unit test, and Sql query test. ## Acknowledgement 1. This PR's work in based on lw-lin's PR apache#14298, with improvements like supporting partial aggregation, fixing out of memory issue. Author: Sean Zhong <seanzhongdatabricks.com> Closes apache#14868 from clockfly/appro_percentile_try_2. Author: Sean Zhong <[email protected]> Closes apache#73 from clockfly/appro_percentile_branch_2.0.
robert3005
added a commit
to robert3005/spark
that referenced
this pull request
Jan 12, 2017
jlopezmalla
pushed a commit
to jlopezmalla/spark
that referenced
this pull request
Nov 3, 2017
jamesrgrinter
pushed a commit
to jamesrgrinter/spark
that referenced
this pull request
Apr 22, 2018
* Added Python API for mapr-streaming (kafka 0.9) Signed-off-by: Rostyslav Sotnychenko <[email protected]> (cherry picked from commit c7de39f)
bzhaoopenstack
pushed a commit
to bzhaoopenstack/spark
that referenced
this pull request
Sep 11, 2019
…t instead (apache#73) Novaclient's add_floating_ip function is deprecated use neutron client instead
hn5092
added a commit
to hn5092/spark
that referenced
this pull request
Nov 21, 2019
hn5092
pushed a commit
to hn5092/spark
that referenced
this pull request
Nov 29, 2019
* apache#73 support limit offset
hn5092
added a commit
to hn5092/spark
that referenced
this pull request
Nov 29, 2019
yuexingri
pushed a commit
to yuexingri/spark
that referenced
this pull request
Dec 9, 2019
yuexingri
pushed a commit
to yuexingri/spark
that referenced
this pull request
Dec 9, 2019
arjunshroff
pushed a commit
to arjunshroff/spark
that referenced
this pull request
Nov 24, 2020
turboFei
pushed a commit
to turboFei/spark
that referenced
this pull request
Nov 6, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.