[SPARK-16283][SQL] Implement `percentile_approx` SQL function #14298

lw-lin · 2016-07-21T08:54:24Z

What changes were proposed in this pull request?

This patch Implements percentile_approx SQL function using Spark's implementation of G-K algorithm.

commit 1: moves the G-K algorithm implementation(QuantileSummaries and related tests) from sql/core to sql/catalyst
commit 2: implements percentile_approx using G-K algorithm

How was this patch tested?

Jenkins
added new tests

SparkQA · 2016-07-21T10:50:08Z

Test build #62670 has finished for PR 14298 at commit 1101580.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PercentileApprox(

SparkQA · 2016-07-21T13:38:52Z

Test build #62676 has finished for PR 14298 at commit 1101580.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PercentileApprox(

SparkQA · 2016-07-21T14:04:12Z

Test build #62678 has finished for PR 14298 at commit f21d746.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PercentileApprox(

lw-lin · 2016-07-21T14:08:53Z

@hvanhovell could you take a look at this? Thanks!

thunterdb · 2016-07-21T16:33:55Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileApprox.scala

+
+/**
+ * Helper class to compute approximate quantile summary.
+ * This implementation is based on the algorithm proposed in the paper:


why did you need to move the implementation around?

thanks!

module sql/core depends on module sql/catalyst, but module sql/catalyst does not depend on module sql/core; so in order to enable expressions like PercentileApprox (which lives in sql/catalyst) to use of the G-K algorithm, the algorithm should probably live in sql/catalyst. As a reference, algorithms like HyperLogLog also live in sql/catalyst(please refer to HyperLogLogPlusPlus).

@thunterdb does this make sense?

Yes, this makes sense, thanks.

# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.sca la

lw-lin · 2016-07-22T03:35:54Z

@cloud-fan could you also help review this? Thanks!

SparkQA · 2016-07-22T05:48:37Z

Test build #62706 has finished for PR 14298 at commit da87bfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StringToMap(text: Expression, pairDelim: Expression, keyValueDelim: Expression)

lw-lin · 2016-07-26T10:00:47Z

Jenkins retest this please

hvanhovell · 2016-07-26T10:55:10Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileApprox.scala

+   */
+  private def validatePercentilesLiteral(exp: Expression): (Seq[Double], Boolean) = {
+    def withinRange(v: Double): Boolean = 0.0 <= v && v <= 1.0
+    exp match {


You could also pattern match on the result of the exp.eval(). That would be way easier.

SparkQA · 2016-07-26T11:59:39Z

Test build #62881 has finished for PR 14298 at commit da87bfd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StringToMap(text: Expression, pairDelim: Expression, keyValueDelim: Expression)

dongjoon-hyun · 2016-07-26T17:14:45Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileApprox.scala

+ * large numbers of rows where the regular percentile() UDAF might run out of memory.
+ *
+ * The input is a single double value or an array of double values representing the percentiles
+ * requested. The output, corresponding to the input, is either an single double value or an


minor: an single -> a single

SparkQA · 2016-08-02T06:40:27Z

Test build #63105 has finished for PR 14298 at commit c0acf16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lw-lin · 2016-08-02T07:38:29Z

@hvanhovell comments addressed. Please let me know when there's more to do!

cloud-fan · 2016-08-08T07:29:32Z

...st/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileApprox.scala

+  // The number of intermediate outputs is highly relative to the actual data-set (an upper bound is
+  // (11/2e)log(2en), where e is the relativeError parameter, n is the number of items in the
+  // dataset) -- thus it's hard to allocate agg buffer in advance without knowing the size of
+  // inputs. Due to this reason, currently we don't support partial mode.


can you explain a bit more about this? AFAIK, hive supports partial aggregate for percentile_approx, and it looks to me that your implementation keeps the buffer data(QuantileSummaries) in this aggregate function object, instead of letting aggregate operator manage it, that's the main reason why we can't support partial aggregate for percentile_approx I think.

hive supports partial aggregate for percentile_approx

Hive's implementation computes approximate percentile values from a histogram, thus Hive supports partial aggregation (but makes no approximation guarantees).

... QuantileSummaries in this aggregate function object, instead of letting aggregate operator manage it, that's the main reason why we can't support partial aggregate

Yes that's quite right. QuantileSummaries has been implemented and well tested prior to this patch, so it'd be great if we can reuse that and put a QuantileSummaries instance directly into the aggregation buffer (in order to support partial aggregation). @cloud-fan any pointer on how to do that please?

Thanks!

to be clear, are you saying that our current algorithm to compute percentile_approx can't support partial aggregation fundamentally?

Fundamentally the QuantileSummaries implementation does support RDD-style partial aggregation -- QuantileSummaries itself is the agg buffer of a partition's data at mappers, and multiple QuantileSummariess will be merged at reducers (please refer to StatFunctions.multipleApproxQuantiles).

Fundamentally it should also support the SparkSQL-style partial aggregation. I'm trying to reuse QuantileSummaries here; if there's no easy way to reuse QuantileSummaries, I'm afraid we'll have to re-write this GK algorithm totally to support our SparkSQL-style partial aggregation.

So any way we can just directly put a QuantileSummaries instance into SparkSQL's agg buffer? Or do we have to break a QuantileSummaries instance up, say into an int + a double + a double[] + an int[] + an int[] which SparkSQL's agg buffer can manage?

I think putting buffer object in agg buffer row is better, but that need to be well designed. Can you hold this PR for a while? We are discussing about it internally.

Sure, this can wait. Thanks for the information!

Some updates, #14753
is created to support putting generic object in aggregation buffer.

@clockfly cool!
now that #14754 has been merged, looks like I should update this patch once #14753 gets merged.

clockfly · 2016-08-29T21:50:17Z

@lw-lin
There are other issues to fix besides partial aggregation, like out of memory, argument type (B, and percentages should use catalyst expression instead so that we support constant folding), naming, input type check, sql. I have created another PR, and acknowledged your work in the PR description. Do you want to take a look at #14868?

lw-lin · 2016-08-30T02:17:53Z

@clockfly
Your PR's improvements overall look reasonable to me, so I'm closing this one. Good job!

…which supports partial aggregation. ## What changes were proposed in this pull request? This PR implements aggregation function `percentile_approx`. Function `percentile_approx` returns the approximate percentile(s) of a column at the given percentage(s). A percentile is a watermark value below which a given percentage of the column values fall. For example, the percentile of column `col` at percentage 50% is the median value of column `col`. ### Syntax: ``` # Returns percentile at a given percentage value. The approximation error can be reduced by increasing parameter accuracy, at the cost of memory. percentile_approx(col, percentage [, accuracy]) # Returns percentile value array at given percentage value array percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) ``` ### Features: 1. This function supports partial aggregation. 2. The memory consumption is bounded. The larger `accuracy` parameter we choose, we smaller error we get. The default accuracy value is 10000, to match with Hive default setting. Choose a smaller value for smaller memory footprint. 3. This function supports window function aggregation. ### Example usages: ``` ## Returns the 25th percentile value, with default accuracy SELECT percentile_approx(col, 0.25) FROM table ## Returns an array of percentile value (25th, 50th, 75th), with default accuracy SELECT percentile_approx(col, array(0.25, 0.5, 0.75)) FROM table ## Returns 25th percentile value, with custom accuracy value 100, larger accuracy parameter yields smaller approximation error SELECT percentile_approx(col, 0.25, 100) FROM table ## Returns the 25th, and 50th percentile values, with custom accuracy value 100 SELECT percentile_approx(col, array(0.25, 0.5), 100) FROM table ``` ### NOTE: 1. The `percentile_approx` implementation is different from Hive, so the result returned on same query maybe slightly different with Hive. This implementation uses `QuantileSummaries` as the underlying probabilistic data structure, and mainly follows paper `Space-efficient Online Computation of Quantile Summaries` by Greenwald, Michael and Khanna, Sanjeev. (http://dx.doi.org/10.1145/375663.375670)` 2. The current implementation of `QuantileSummaries` doesn't support automatic compression. This PR has a rule to do compression automatically at the caller side, but it may not be optimal. ## How was this patch tested? Unit test, and Sql query test. ## Acknowledgement 1. This PR's work in based on lw-lin's PR apache#14298, with improvements like supporting partial aggregation, fixing out of memory issue. Author: Sean Zhong <[email protected]> Closes apache#14868 from clockfly/appro_percentile_try_2.

lw-lin added 2 commits July 21, 2016 20:04

Move G-K alg from sql/core to sql/catalyst

cb1c7c3

Implement percentile_approx

f21d746

lw-lin force-pushed the impl_percentile_approx branch from 1101580 to f21d746 Compare July 21, 2016 12:04

thunterdb reviewed Jul 21, 2016
View reviewed changes

Merge remote-tracking branch 'apache/master' into impl_percentile_approx

da87bfd

# Conflicts: # sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.sca la

hvanhovell reviewed Jul 26, 2016
View reviewed changes

dongjoon-hyun reviewed Jul 26, 2016
View reviewed changes

Address comments

c0acf16

cloud-fan reviewed Aug 8, 2016
View reviewed changes

clockfly mentioned this pull request Aug 29, 2016

[SPARK-16283][SQL] Implements percentile_approx aggregation function which supports partial aggregation. #14868

Closed

lw-lin closed this Aug 30, 2016

lw-lin deleted the impl_percentile_approx branch August 30, 2016 02:22

[SPARK-16283][SQL] Implement percentile_approx SQL function #14298

[SPARK-16283][SQL] Implement percentile_approx SQL function #14298

Uh oh!

Conversation

lw-lin commented Jul 21, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 21, 2016

Uh oh!

SparkQA commented Jul 21, 2016

Uh oh!

SparkQA commented Jul 21, 2016

Uh oh!

lw-lin commented Jul 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw-lin commented Jul 22, 2016

Uh oh!

SparkQA commented Jul 22, 2016

Uh oh!

lw-lin commented Jul 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 2, 2016

Uh oh!

lw-lin commented Aug 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw-lin Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lw-lin Aug 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clockfly commented Aug 29, 2016

Uh oh!

lw-lin commented Aug 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[SPARK-16283][SQL] Implement `percentile_approx` SQL function #14298

[SPARK-16283][SQL] Implement `percentile_approx` SQL function #14298

lw-lin commented Aug 2, 2016 •

edited

Loading

lw-lin Aug 9, 2016 •

edited

Loading

lw-lin Aug 9, 2016 •

edited

Loading