Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

What changes were proposed in this pull request?

This PR is a rebased version of original work link by
@ptkool.

Please give credit to @ptkool for this work.

Description from original PR:
This pull request implements the EVERY and ANY aggregates.

How was this patch tested?

Testing was performed using unit tests, integration tests, and manual tests.

@dilipbiswal
Copy link
Contributor Author

@gatorsmile I tried to implement the rewrites suggested in the original PR. It does not seem very straightforward to me. The basic issue is, we are unable to replace the aggregate expression to a scalar expression over aggregates. We only support limited number of true aggregate expressions under window.

For example -we are unable to rewrite .

select key, value, some(value) over(partition by key order by value) from src group by key, value

to

select key, value, coalesce(max(c1) == true, false) over(partition by key order by value) from src group by key, value

I tried a similar frame work to replace aggregate expressions like ReplaceExpressions. Please let me know what you think.

@SparkQA
Copy link

SparkQA commented Aug 8, 2018

Test build #94456 has finished for PR 22047 at commit 9503d9e.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 9, 2018

Test build #94459 has finished for PR 22047 at commit 6288a05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Please give credit to @ptkool for this work.

FWIW, we can now credit to multiple people per 51bee7a :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, looks unrelated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Will look into this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks unrelated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Not sure why.. when i did a build/sbt doc , i got an error here. Thats the reason i had to fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: previous indentation was correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Thanks ... will fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. will fix

@SparkQA
Copy link

SparkQA commented Aug 10, 2018

Test build #94586 has finished for PR 22047 at commit af4d901.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 11, 2018

Test build #94588 has finished for PR 22047 at commit 6593cf4.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 11, 2018

Test build #94602 has finished for PR 22047 at commit 6593cf4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 2, 2018

Test build #96872 has finished for PR 22047 at commit 291a13d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Oct 2, 2018

Let me post something I wrote recently. Could you add test cases to ensure that we do not break the "Ignore NULLs" policy

All the set/aggregate functions ignore NULLs. The typical built-in Set/Aggregate functions are AVG, COUNT, MAX, MIN, SUM, GROUPING.

Note, COUNT(*) is actually equivalent to COUNT(1). Thus, it still includes rows containing null.

Tip, because of the "Ignore NULLs" policy, Sum(a) + Sum(b) is not the same as Sum(a+b).

Note, although the set functions follow the "Ignore NULLs" policy, MIN, MAX, SUM AVG, EVERY, ANY and SOME returns NULL if 1) every value is NULL or 2) SELECT returns no row at all. COUNT never returns NULL.

TODO: When a set function eliminates NULLs, Spark SQL does not follow others to issue a warning message SQLSTATE 01003 "null value eliminated in set function".

TODO: Check whether all the expressions that extend AggregateFunction follow the "Ignore NULLs" policy. If not, we need more investigation to see whether we should correct them.

TODO: When Spark SQL supports ALL, ANY, and SOME, they follow the same "Ignore NULLs" policy.

@dilipbiswal
Copy link
Contributor Author

@gatorsmile Thanks.. I will check.

@dilipbiswal
Copy link
Contributor Author

@gatorsmile First of all, thank you very much . Actually the added aggregates weren't null filtering. I have fixed the issue and have added additional test cases. Thank you.

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96968 has finished for PR 22047 at commit b378fff.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96978 has finished for PR 22047 at commit b378fff.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

is it possible to rewrite these 3 new functions with existing expression? e.g.
every(col) -> count(if (col) null else 1) == 0
any(col) -> count(if (col) 1 else null) > 0

@dilipbiswal
Copy link
Contributor Author

@cloud-fan please see my comment link. I had tried to rewrite using max and min as suggested by Herman and Reynold in the original pr. I was unable to do it when the aggregate is part of the window.

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96995 has finished for PR 22047 at commit e1764df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

return Column(jc)


def every(col):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the SQL functions and remove the function APIs. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Hi Sean, I have prepared two branches. One in which these new aggregate functions are extending from the base Max and Min class basically reusing code. The other in which we replace these aggregate expressions in the optimizer. Below are the links.

  1. branch-extend

  2. branch-rewrite

I would prefer option 1 because of the following reasons.

  1. Code changes are simpler
  2. Supports these aggregates as window expressions naturally. In the other option i have
    to block it.
  3. It seems to me for these simple mapping, we probably don't need a rewrite frame work. We could add it in the future if we need a little complex transformation.

Please let me know how we want to move forward with this. Thanks !!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for option 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Thank you very much for your response. I will create a new PR based on option-1 today and close this one.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97731 has finished for PR 22047 at commit e1764df.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97710 has finished for PR 22047 at commit e1764df.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97798 has finished for PR 22047 at commit e1764df.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.sql.types._

@ExpressionDescription(
usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is true.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, don't forget to add since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants