Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented May 8, 2019

What changes were proposed in this pull request?

This PR goes to add max_by() and min_by() SQL aggregate functions.

Quoting from the Presto docs

max_by(x, y) → [same as x]
Returns the value of x associated with the maximum value of y over all input values.

min_by() works similarly.

How was this patch tested?

Added tests.

@JoshRosen
Copy link
Contributor

Hi @viirya,

Thanks for working on this!

I had a few quick questions:

  • Could you also implement min_by(x, y)?

    • It looks like you might be able to share most of the code except for replacing GreaterThan and greatest, so maybe this difference could be abstracted away via a shared abstract superclass.
  • Presto also has three-argument versions of max_by / min_by:

    max_by(x, y, n) → array<[same as x]>
    Returns n values of x associated with the n largest of all input values of y in descending order of y.

    I don't think we need to do this version now, especially since we can always add it in a separate followup PR (which is what Presto originally did: Implement max_by and min_by with an additional n parameter prestodb/presto#3620)

  • Were there any bugs in older implementations of Presto version that we might have replicated here? Or Presto tests for edge-cases that we could emulate?

)

checkAnswer(
sql("SELECT max_by(x, y) FROM VALUES (('a', null)), (('b', null)) AS tab(x, y)"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns null because all values of the ordering column are null? That seems to match Presto behavior:

SELECT max_by(x, y) FROM (
  VALUES
    ('a', null),
    ('b', null)
) AS tab (x, y)

also returns null in Presto 👍

Copy link
Contributor

@JoshRosen JoshRosen May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense if you think of this function as being semantically equivalent to

SELECT first(x) FROM tab WHERE y = max(y)

@SparkQA
Copy link

SparkQA commented May 8, 2019

Test build #105258 has finished for PR 24557 at commit 5c7e3c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class MaxBy(valueExpr: Expression, maxExpr: Expression) extends DeclarativeAggregate

@viirya
Copy link
Member Author

viirya commented May 9, 2019

@JoshRosen Thanks for the review!

  • Could you also implement min_by(x, y)?

Yes. I originally planed to have separate PR for it. I'm fine to add it here. A shared abstract superclass to share code is good.

  • Presto also has three-argument versions of max_by / min_by:

Agreed. We don't need three-argument versions now. If we need it, we can add it in a followup.

  • Were there any bugs in older implementations of Presto version that we might have replicated here? Or Presto tests for edge-cases that we could emulate?
  • For using rows / structs as the ordering value, I also think it would work. I will add few tests.
  • For null ordering values, I already have few test cases. I checked Presto's results and they are matched. Let me see double-check if we've covered the same edge-case.

@viirya
Copy link
Member Author

viirya commented May 9, 2019

I've checked few test cases regarding null values in Presto:

presto> select max_by(x, y) from ( values ('a', null), ('b', null), ('c', null) ) as t (x, y);
 _col0 
-------
 NULL  
(1 row)

Query 20190509_050643_00001_ww5mk, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:02 [0 rows, 0B] [0 rows/s, 0B/s]

presto> select max_by(x, y) from ( values ('a', null), ('b', null), ('c', 10) ) as t (x, y);
 _col0 
-------
 c     
(1 row)

Query 20190509_050655_00002_ww5mk, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

The results match the added test cases here.

About prestodb/presto#2040, it is happened by null reference for the key field in Presto. This shouldn't be in our case.

@viirya viirya changed the title [SPARK-27653][SQL] Add max_by() SQL aggregate function [SPARK-27653][SQL] Add max_by() and min_by() SQL aggregate functions May 9, 2019
@SparkQA
Copy link

SparkQA commented May 9, 2019

Test build #105286 has finished for PR 24557 at commit 798f0fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented May 10, 2019

cc @cloud-fan @dongjoon-hyun

package org.apache.spark.sql.catalyst.expressions.aggregate

import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
import org.apache.spark.sql.catalyst.dsl.expressions._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we import the expression DSL, can we use DSL to build the expression tree in this file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some can, like And, IsNull. Some can't, like CaseWhen, If.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrite And and IsNull using DSL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add DSL for CaseWhen and If. Not a blocker here.

@SparkQA
Copy link

SparkQA commented May 11, 2019

Test build #105329 has finished for PR 24557 at commit 26b5a32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@JoshRosen JoshRosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

override def checkInputDataTypes(): TypeCheckResult =
TypeUtils.checkForOrderingExpr(orderingExpr.dataType, s"function $funcName")

private lazy val ordering = AttributeReference("ordering", orderingExpr.dataType)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maxOrdering is more precise.

TypeUtils.checkForOrderingExpr(orderingExpr.dataType, s"function $funcName")

private lazy val ordering = AttributeReference("ordering", orderingExpr.dataType)()
private lazy val value = AttributeReference("value", valueExpr.dataType)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valueWithMaxOrdering

override protected def funcName: String = "max_by"

override protected def predicate(oldExpr: Expression, newExpr: Expression): Expression =
GreaterThan(oldExpr, newExpr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit olderExpr > newExpr

@cloud-fan
Copy link
Contributor

LGTM except a few code style comments

@viirya
Copy link
Member Author

viirya commented May 13, 2019

Note that we don't add them into functions, currently. If needed, we can add it in a followup in the future.

override def checkInputDataTypes(): TypeCheckResult =
TypeUtils.checkForOrderingExpr(orderingExpr.dataType, s"function $funcName")

// The attributes used to keep extremum (max or min) and associated aggregated values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's a good point. Shall we call it extremumOrdering then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me. +1

@SparkQA
Copy link

SparkQA commented May 13, 2019

Test build #105349 has finished for PR 24557 at commit dd1d9de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 13, 2019

Test build #105350 has finished for PR 24557 at commit 05f1767.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in d169b0a May 13, 2019
rdblue pushed a commit to rdblue/spark that referenced this pull request Jul 3, 2019
This PR goes to add `max_by()` and `min_by()` SQL aggregate functions.

Quoting from the [Presto docs](https://prestodb.github.io/docs/current/functions/aggregate.html#max_by)

> max_by(x, y) → [same as x]
> Returns the value of x associated with the maximum value of y over all input values.

`min_by()` works similarly.

Added tests.

Closes apache#24557 from viirya/SPARK-27653.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d169b0a)
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Oct 15, 2019
This PR goes to add `max_by()` and `min_by()` SQL aggregate functions.

Quoting from the [Presto docs](https://prestodb.github.io/docs/current/functions/aggregate.html#max_by)

> max_by(x, y) → [same as x]
> Returns the value of x associated with the maximum value of y over all input values.

`min_by()` works similarly.

Added tests.

Closes apache#24557 from viirya/SPARK-27653.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Oct 26, 2019
### What changes were proposed in this pull request?

This is a follow-up of #24557 to fix `since` version.

### Why are the changes needed?

This is found during 3.0.0-preview preparation.
The version will be exposed to our SQL document like the following. We had better fix this.
- https://spark.apache.org/docs/latest/api/sql/#array_min

### Does this PR introduce any user-facing change?

Yes. It's exposed at `DESC FUNCTION EXTENDED` SQL command and  SQL doc, but this is new at 3.0.0.

### How was this patch tested?

Manual.
```
spark-sql> DESC FUNCTION EXTENDED min_by;
Function: min_by
Class: org.apache.spark.sql.catalyst.expressions.aggregate.MinBy
Usage: min_by(x, y) - Returns the value of `x` associated with the minimum value of `y`.
Extended Usage:
    Examples:
      > SELECT min_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y);
       a

    Since: 3.0.0
```

Closes #26264 from dongjoon-hyun/SPARK-27653.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@viirya viirya deleted the SPARK-27653 branch December 27, 2023 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants