[SPARK-27653][SQL] Add max_by() and min_by() SQL aggregate functions #24557

viirya · 2019-05-08T15:12:28Z

What changes were proposed in this pull request?

This PR goes to add max_by() and min_by() SQL aggregate functions.

Quoting from the Presto docs

max_by(x, y) → [same as x]
Returns the value of x associated with the maximum value of y over all input values.

min_by() works similarly.

How was this patch tested?

Added tests.

JoshRosen · 2019-05-08T16:10:04Z

Hi @viirya,

Thanks for working on this!

I had a few quick questions:

Could you also implement min_by(x, y)?
- It looks like you might be able to share most of the code except for replacing GreaterThan and greatest, so maybe this difference could be abstracted away via a shared abstract superclass.
Presto also has three-argument versions of max_by / min_by:

max_by(x, y, n) → array<[same as x]>
Returns n values of x associated with the n largest of all input values of y in descending order of y.

I don't think we need to do this version now, especially since we can always add it in a separate followup PR (which is what Presto originally did: Implement max_by and min_by with an additional n parameter prestodb/presto#3620)
Were there any bugs in older implementations of Presto version that we might have replicated here? Or Presto tests for edge-cases that we could emulate?
- max_by does not support row prestodb/presto#7646 discusses using rows / structs as the ordering value. I suspect that would work already in your implementation, but it might be nice to check.
- NPE when using max_by/min_by prestodb/presto#2040 discusses a problem with null ordering values. It looks like you're testing that case, but I just wanted to double-check that we've covered the same edge-case there.

JoshRosen · 2019-05-08T16:16:50Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

+    )
+
+    checkAnswer(
+      sql("SELECT max_by(x, y) FROM VALUES (('a', null)), (('b', null)) AS tab(x, y)"),


This returns null because all values of the ordering column are null? That seems to match Presto behavior:

SELECT max_by(x, y) FROM ( VALUES ('a', null), ('b', null) ) AS tab (x, y)

also returns null in Presto 👍

This makes sense if you think of this function as being semantically equivalent to

SELECT first(x) FROM tab WHERE y = max(y)

SparkQA · 2019-05-08T18:29:36Z

Test build #105258 has finished for PR 24557 at commit 5c7e3c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class MaxBy(valueExpr: Expression, maxExpr: Expression) extends DeclarativeAggregate

viirya · 2019-05-09T05:14:04Z

@JoshRosen Thanks for the review!

Could you also implement min_by(x, y)?

Yes. I originally planed to have separate PR for it. I'm fine to add it here. A shared abstract superclass to share code is good.

Presto also has three-argument versions of max_by / min_by:

Agreed. We don't need three-argument versions now. If we need it, we can add it in a followup.

Were there any bugs in older implementations of Presto version that we might have replicated here? Or Presto tests for edge-cases that we could emulate?

For using rows / structs as the ordering value, I also think it would work. I will add few tests.
For null ordering values, I already have few test cases. I checked Presto's results and they are matched. Let me see double-check if we've covered the same edge-case.

viirya · 2019-05-09T15:38:55Z

I've checked few test cases regarding null values in Presto:

presto> select max_by(x, y) from ( values ('a', null), ('b', null), ('c', null) ) as t (x, y);
 _col0 
-------
 NULL  
(1 row)

Query 20190509_050643_00001_ww5mk, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]

presto> select max_by(x, y) from ( values ('a', null), ('b', null), ('c', 10) ) as t (x, y);
 _col0 
-------
 c     
(1 row)

Query 20190509_050655_00002_ww5mk, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

The results match the added test cases here.

About prestodb/presto#2040, it is happened by null reference for the key field in Presto. This shouldn't be in our case.

SparkQA · 2019-05-09T19:03:25Z

Test build #105286 has finished for PR 24557 at commit 798f0fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-05-10T07:15:51Z

cc @cloud-fan @dongjoon-hyun

cloud-fan · 2019-05-10T13:52:16Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/MaxByAndMinBy.scala

+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
+import org.apache.spark.sql.catalyst.dsl.expressions._


since we import the expression DSL, can we use DSL to build the expression tree in this file?

Some can, like And, IsNull. Some can't, like CaseWhen, If.

I rewrite And and IsNull using DSL.

maybe we can add DSL for CaseWhen and If. Not a blocker here.

SparkQA · 2019-05-11T04:49:22Z

Test build #105329 has finished for PR 24557 at commit 26b5a32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen

LGTM!

cloud-fan · 2019-05-13T07:05:53Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/MaxByAndMinBy.scala

+  override def checkInputDataTypes(): TypeCheckResult =
+    TypeUtils.checkForOrderingExpr(orderingExpr.dataType, s"function $funcName")
+
+  private lazy val ordering = AttributeReference("ordering", orderingExpr.dataType)()


nit: maxOrdering is more precise.

cloud-fan · 2019-05-13T07:06:03Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/MaxByAndMinBy.scala

+    TypeUtils.checkForOrderingExpr(orderingExpr.dataType, s"function $funcName")
+
+  private lazy val ordering = AttributeReference("ordering", orderingExpr.dataType)()
+  private lazy val value = AttributeReference("value", valueExpr.dataType)()


valueWithMaxOrdering

cloud-fan · 2019-05-13T07:07:04Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/MaxByAndMinBy.scala

+  override protected def funcName: String = "max_by"
+
+  override protected def predicate(oldExpr: Expression, newExpr: Expression): Expression =
+    GreaterThan(oldExpr, newExpr)


nit olderExpr > newExpr

cloud-fan · 2019-05-13T07:07:41Z

LGTM except a few code style comments

viirya · 2019-05-13T08:30:38Z

Note that we don't add them into functions, currently. If needed, we can add it in a followup in the future.

cloud-fan · 2019-05-13T08:33:40Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/MaxByAndMinBy.scala

+  override def checkInputDataTypes(): TypeCheckResult =
+    TypeUtils.checkForOrderingExpr(orderingExpr.dataType, s"function $funcName")
+
+  // The attributes used to keep extremum (max or min) and associated aggregated values.


ah that's a good point. Shall we call it extremumOrdering then?

Good for me. +1

SparkQA · 2019-05-13T11:39:28Z

Test build #105349 has finished for PR 24557 at commit dd1d9de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-13T11:45:16Z

Test build #105350 has finished for PR 24557 at commit 05f1767.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-05-13T14:37:47Z

thanks, merging to master!

This PR goes to add `max_by()` and `min_by()` SQL aggregate functions. Quoting from the [Presto docs](https://prestodb.github.io/docs/current/functions/aggregate.html#max_by) > max_by(x, y) → [same as x] > Returns the value of x associated with the maximum value of y over all input values. `min_by()` works similarly. Added tests. Closes apache#24557 from viirya/SPARK-27653. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d169b0a)

This PR goes to add `max_by()` and `min_by()` SQL aggregate functions. Quoting from the [Presto docs](https://prestodb.github.io/docs/current/functions/aggregate.html#max_by) > max_by(x, y) → [same as x] > Returns the value of x associated with the maximum value of y over all input values. `min_by()` works similarly. Added tests. Closes apache#24557 from viirya/SPARK-27653. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This is a follow-up of #24557 to fix `since` version. ### Why are the changes needed? This is found during 3.0.0-preview preparation. The version will be exposed to our SQL document like the following. We had better fix this. - https://spark.apache.org/docs/latest/api/sql/#array_min ### Does this PR introduce any user-facing change? Yes. It's exposed at `DESC FUNCTION EXTENDED` SQL command and SQL doc, but this is new at 3.0.0. ### How was this patch tested? Manual. ``` spark-sql> DESC FUNCTION EXTENDED min_by; Function: min_by Class: org.apache.spark.sql.catalyst.expressions.aggregate.MinBy Usage: min_by(x, y) - Returns the value of `x` associated with the minimum value of `y`. Extended Usage: Examples: > SELECT min_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y); a Since: 3.0.0 ``` Closes #26264 from dongjoon-hyun/SPARK-27653. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Add max_by.

5c7e3c5

JoshRosen reviewed May 8, 2019

View reviewed changes

Add min_by.

79f1015

Add test.

798f0fa

viirya changed the title ~~[SPARK-27653][SQL] Add max_by() SQL aggregate function~~ [SPARK-27653][SQL] Add max_by() and min_by() SQL aggregate functions May 9, 2019

cloud-fan reviewed May 10, 2019

View reviewed changes

Using DSL.

26b5a32

JoshRosen approved these changes May 12, 2019

View reviewed changes

cloud-fan reviewed May 13, 2019

View reviewed changes

Address comments.

dd1d9de

cloud-fan reviewed May 13, 2019

View reviewed changes

Renaming variables.

05f1767

cloud-fan closed this in d169b0a May 13, 2019

dongjoon-hyun mentioned this pull request Oct 25, 2019

[SPARK-27653][SQL][FOLLOWUP] Fix since version of min_by/max_by #26264

Closed

viirya deleted the SPARK-27653 branch December 27, 2023 18:22

[SPARK-27653][SQL] Add max_by() and min_by() SQL aggregate functions #24557

[SPARK-27653][SQL] Add max_by() and min_by() SQL aggregate functions #24557

Uh oh!

Conversation

viirya commented May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

JoshRosen commented May 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoshRosen May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 8, 2019

Uh oh!

viirya commented May 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented May 9, 2019

Uh oh!

SparkQA commented May 9, 2019

Uh oh!

viirya commented May 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 11, 2019

Uh oh!

JoshRosen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented May 13, 2019

Uh oh!

viirya commented May 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 13, 2019

Uh oh!

SparkQA commented May 13, 2019

Uh oh!

cloud-fan commented May 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya commented May 8, 2019 •

edited

Loading

JoshRosen May 8, 2019 •

edited

Loading

viirya commented May 9, 2019 •

edited

Loading