[SPARK-19851] Add support for EVERY and ANY (SOME) aggregates #22047

dilipbiswal · 2018-08-08T20:05:08Z

What changes were proposed in this pull request?

This PR is a rebased version of original work link by
@ptkool.

Please give credit to @ptkool for this work.

Description from original PR:
This pull request implements the EVERY and ANY aggregates.

How was this patch tested?

Testing was performed using unit tests, integration tests, and manual tests.

dilipbiswal · 2018-08-08T20:10:29Z

@gatorsmile I tried to implement the rewrites suggested in the original PR. It does not seem very straightforward to me. The basic issue is, we are unable to replace the aggregate expression to a scalar expression over aggregates. We only support limited number of true aggregate expressions under window.

For example -we are unable to rewrite .

select key, value, some(value) over(partition by key order by value) from src group by key, value

to

select key, value, coalesce(max(c1) == true, false) over(partition by key order by value) from src group by key, value

I tried a similar frame work to replace aggregate expressions like ReplaceExpressions. Please let me know what you think.

SparkQA · 2018-08-08T20:28:11Z

Test build #94456 has finished for PR 22047 at commit 9503d9e.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-09T02:07:27Z

Test build #94459 has finished for PR 22047 at commit 6288a05.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-08-10T02:26:41Z

Please give credit to @ptkool for this work.

FWIW, we can now credit to multiple people per 51bee7a :-)

HyukjinKwon · 2018-08-10T02:27:01Z

python/pyspark/sql/functions.py

hm, looks unrelated.

@HyukjinKwon Will look into this.

HyukjinKwon · 2018-08-10T02:27:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

Looks unrelated

@HyukjinKwon Not sure why.. when i did a build/sbt doc , i got an error here. Thats the reason i had to fix.

HyukjinKwon · 2018-08-10T02:28:09Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

nit: previous indentation was correct.

@HyukjinKwon Thanks ... will fix.

HyukjinKwon · 2018-08-10T02:28:25Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

nit: since version

Thanks. will fix

SparkQA · 2018-08-10T20:56:13Z

Test build #94586 has finished for PR 22047 at commit af4d901.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-11T02:03:23Z

Test build #94588 has finished for PR 22047 at commit 6593cf4.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-08-11T07:48:56Z

retest this please

SparkQA · 2018-08-11T11:51:34Z

Test build #94602 has finished for PR 22047 at commit 6593cf4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-02T23:27:29Z

Test build #96872 has finished for PR 22047 at commit 291a13d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-10-02T23:57:45Z

Let me post something I wrote recently. Could you add test cases to ensure that we do not break the "Ignore NULLs" policy

All the set/aggregate functions ignore NULLs. The typical built-in Set/Aggregate functions are AVG, COUNT, MAX, MIN, SUM, GROUPING.

Note, COUNT(*) is actually equivalent to COUNT(1). Thus, it still includes rows containing null.

Tip, because of the "Ignore NULLs" policy, Sum(a) + Sum(b) is not the same as Sum(a+b).

Note, although the set functions follow the "Ignore NULLs" policy, MIN, MAX, SUM AVG, EVERY, ANY and SOME returns NULL if 1) every value is NULL or 2) SELECT returns no row at all. COUNT never returns NULL.

TODO: When a set function eliminates NULLs, Spark SQL does not follow others to issue a warning message SQLSTATE 01003 "null value eliminated in set function".

TODO: Check whether all the expressions that extend AggregateFunction follow the "Ignore NULLs" policy. If not, we need more investigation to see whether we should correct them.

TODO: When Spark SQL supports ALL, ANY, and SOME, they follow the same "Ignore NULLs" policy.

dilipbiswal · 2018-10-03T00:31:48Z

@gatorsmile Thanks.. I will check.

dilipbiswal · 2018-10-05T05:07:07Z

@gatorsmile First of all, thank you very much . Actually the added aggregates weren't null filtering. I have fixed the issue and have added additional test cases. Thank you.

SparkQA · 2018-10-05T07:05:02Z

Test build #96968 has finished for PR 22047 at commit b378fff.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2018-10-05T07:25:08Z

retest this please

SparkQA · 2018-10-05T10:56:15Z

Test build #96978 has finished for PR 22047 at commit b378fff.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-05T14:58:50Z

is it possible to rewrite these 3 new functions with existing expression? e.g.
every(col) -> count(if (col) null else 1) == 0
any(col) -> count(if (col) 1 else null) > 0

dilipbiswal · 2018-10-05T15:09:33Z

@cloud-fan please see my comment link. I had tried to rewrite using max and min as suggested by Herman and Reynold in the original pr. I was unable to do it when the aggregate is part of the window.

SparkQA · 2018-10-05T17:33:41Z

Test build #96995 has finished for PR 22047 at commit e1764df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-10-06T00:17:34Z

python/pyspark/sql/functions.py

    return Column(jc)


+def every(col):


Please keep the SQL functions and remove the function APIs. Thanks!

@gatorsmile OK.

@gatorsmile Hi Sean, I have prepared two branches. One in which these new aggregate functions are extending from the base Max and Min class basically reusing code. The other in which we replace these aggregate expressions in the optimizer. Below are the links.

branch-extend

branch-rewrite

I would prefer option 1 because of the following reasons.

Code changes are simpler

Supports these aggregates as window expressions naturally. In the other option i have
to block it.

It seems to me for these simple mapping, we probably don't need a rewrite frame work. We could add it in the future if we need a little complex transformation.

Please let me know how we want to move forward with this. Thanks !!

+1 for option 1

@cloud-fan Thank you very much for your response. I will create a new PR based on option-1 today and close this one.

SparkQA · 2018-10-22T07:05:05Z

Test build #97731 has finished for PR 22047 at commit e1764df.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-22T07:05:05Z

Test build #97710 has finished for PR 22047 at commit e1764df.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-22T14:29:21Z

Test build #97798 has finished for PR 22047 at commit e1764df.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-23T12:33:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/AnyAgg.scala

+import org.apache.spark.sql.types._
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is true.")


BTW, don't forget to add since.

HyukjinKwon reviewed Aug 10, 2018

View reviewed changes

ptkool and others added 11 commits October 2, 2018 10:55

Add new aggregates EVERY and ANY (SOME).

0eb5992

Fix Scala style check errors.

a8cf7e1

Resolved issue with Any aggregate and added window function test.

519a455

Added additional pyspark.sql tests.

a929914

Fix pyspark window function tests.

d65ef4a

Resolve several issues with Pyspark tests.

3217636

Resolve Scala style issues.

309d4b6

Fix Python style errors.

b5e9afb

generatedoc fix

d05ca69

code review

d050193

python codestyle

291a13d

dilipbiswal force-pushed the SPARK-19851 branch from 6593cf4 to 291a13d Compare October 2, 2018 19:31

Add null filtering logic to the aggregate function along with tests

b378fff

Fix

e1764df

gatorsmile reviewed Oct 6, 2018

View reviewed changes

HyukjinKwon reviewed Oct 23, 2018

View reviewed changes

dilipbiswal mentioned this pull request Oct 26, 2018

[SPARK-19851][SQL] Add support for EVERY and ANY (SOME) aggregates #22809

Closed

dilipbiswal closed this Oct 28, 2018

[SPARK-19851] Add support for EVERY and ANY (SOME) aggregates #22047

[SPARK-19851] Add support for EVERY and ANY (SOME) aggregates #22047

Uh oh!

Conversation

dilipbiswal commented Aug 8, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dilipbiswal commented Aug 8, 2018

Uh oh!

SparkQA commented Aug 8, 2018

Uh oh!

SparkQA commented Aug 9, 2018

Uh oh!

HyukjinKwon commented Aug 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 10, 2018

Uh oh!

SparkQA commented Aug 11, 2018

Uh oh!

dilipbiswal commented Aug 11, 2018

Uh oh!

SparkQA commented Aug 11, 2018

Uh oh!

SparkQA commented Oct 2, 2018

Uh oh!

gatorsmile commented Oct 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilipbiswal commented Oct 3, 2018

Uh oh!

dilipbiswal commented Oct 5, 2018

Uh oh!

SparkQA commented Oct 5, 2018

Uh oh!

dilipbiswal commented Oct 5, 2018

Uh oh!

SparkQA commented Oct 5, 2018

Uh oh!

cloud-fan commented Oct 5, 2018

Uh oh!

dilipbiswal commented Oct 5, 2018

Uh oh!

SparkQA commented Oct 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

gatorsmile commented Oct 2, 2018 •

edited

Loading