-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-27653][SQL] Add max_by() and min_by() SQL aggregate functions #24557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @viirya, Thanks for working on this! I had a few quick questions:
|
| ) | ||
|
|
||
| checkAnswer( | ||
| sql("SELECT max_by(x, y) FROM VALUES (('a', null)), (('b', null)) AS tab(x, y)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns null because all values of the ordering column are null? That seems to match Presto behavior:
SELECT max_by(x, y) FROM (
VALUES
('a', null),
('b', null)
) AS tab (x, y)
also returns null in Presto 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense if you think of this function as being semantically equivalent to
SELECT first(x) FROM tab WHERE y = max(y)
|
Test build #105258 has finished for PR 24557 at commit
|
|
@JoshRosen Thanks for the review!
Yes. I originally planed to have separate PR for it. I'm fine to add it here. A shared abstract superclass to share code is good.
Agreed. We don't need three-argument versions now. If we need it, we can add it in a followup.
|
|
I've checked few test cases regarding null values in Presto: The results match the added test cases here. About prestodb/presto#2040, it is happened by null reference for the key field in Presto. This shouldn't be in our case. |
|
Test build #105286 has finished for PR 24557 at commit
|
| package org.apache.spark.sql.catalyst.expressions.aggregate | ||
|
|
||
| import org.apache.spark.sql.catalyst.analysis.TypeCheckResult | ||
| import org.apache.spark.sql.catalyst.dsl.expressions._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we import the expression DSL, can we use DSL to build the expression tree in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some can, like And, IsNull. Some can't, like CaseWhen, If.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrite And and IsNull using DSL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add DSL for CaseWhen and If. Not a blocker here.
|
Test build #105329 has finished for PR 24557 at commit
|
JoshRosen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| override def checkInputDataTypes(): TypeCheckResult = | ||
| TypeUtils.checkForOrderingExpr(orderingExpr.dataType, s"function $funcName") | ||
|
|
||
| private lazy val ordering = AttributeReference("ordering", orderingExpr.dataType)() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maxOrdering is more precise.
| TypeUtils.checkForOrderingExpr(orderingExpr.dataType, s"function $funcName") | ||
|
|
||
| private lazy val ordering = AttributeReference("ordering", orderingExpr.dataType)() | ||
| private lazy val value = AttributeReference("value", valueExpr.dataType)() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valueWithMaxOrdering
| override protected def funcName: String = "max_by" | ||
|
|
||
| override protected def predicate(oldExpr: Expression, newExpr: Expression): Expression = | ||
| GreaterThan(oldExpr, newExpr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit olderExpr > newExpr
|
LGTM except a few code style comments |
|
Note that we don't add them into functions, currently. If needed, we can add it in a followup in the future. |
| override def checkInputDataTypes(): TypeCheckResult = | ||
| TypeUtils.checkForOrderingExpr(orderingExpr.dataType, s"function $funcName") | ||
|
|
||
| // The attributes used to keep extremum (max or min) and associated aggregated values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah that's a good point. Shall we call it extremumOrdering then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good for me. +1
|
Test build #105349 has finished for PR 24557 at commit
|
|
Test build #105350 has finished for PR 24557 at commit
|
|
thanks, merging to master! |
This PR goes to add `max_by()` and `min_by()` SQL aggregate functions. Quoting from the [Presto docs](https://prestodb.github.io/docs/current/functions/aggregate.html#max_by) > max_by(x, y) → [same as x] > Returns the value of x associated with the maximum value of y over all input values. `min_by()` works similarly. Added tests. Closes apache#24557 from viirya/SPARK-27653. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit d169b0a)
This PR goes to add `max_by()` and `min_by()` SQL aggregate functions. Quoting from the [Presto docs](https://prestodb.github.io/docs/current/functions/aggregate.html#max_by) > max_by(x, y) → [same as x] > Returns the value of x associated with the maximum value of y over all input values. `min_by()` works similarly. Added tests. Closes apache#24557 from viirya/SPARK-27653. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This is a follow-up of #24557 to fix `since` version. ### Why are the changes needed? This is found during 3.0.0-preview preparation. The version will be exposed to our SQL document like the following. We had better fix this. - https://spark.apache.org/docs/latest/api/sql/#array_min ### Does this PR introduce any user-facing change? Yes. It's exposed at `DESC FUNCTION EXTENDED` SQL command and SQL doc, but this is new at 3.0.0. ### How was this patch tested? Manual. ``` spark-sql> DESC FUNCTION EXTENDED min_by; Function: min_by Class: org.apache.spark.sql.catalyst.expressions.aggregate.MinBy Usage: min_by(x, y) - Returns the value of `x` associated with the minimum value of `y`. Extended Usage: Examples: > SELECT min_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y); a Since: 3.0.0 ``` Closes #26264 from dongjoon-hyun/SPARK-27653. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This PR goes to add
max_by()andmin_by()SQL aggregate functions.Quoting from the Presto docs
min_by()works similarly.How was this patch tested?
Added tests.