-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-19851] Add support for EVERY and ANY (SOME) aggregates #22047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@gatorsmile I tried to implement the rewrites suggested in the original PR. It does not seem very straightforward to me. The basic issue is, we are unable to replace the aggregate expression to a scalar expression over aggregates. We only support limited number of true aggregate expressions under window. For example -we are unable to rewrite . to I tried a similar frame work to replace aggregate expressions like |
|
Test build #94456 has finished for PR 22047 at commit
|
|
Test build #94459 has finished for PR 22047 at commit
|
python/pyspark/sql/functions.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, looks unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Will look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks unrelated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Not sure why.. when i did a build/sbt doc , i got an error here. Thats the reason i had to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: previous indentation was correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Thanks ... will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. will fix
|
Test build #94586 has finished for PR 22047 at commit
|
|
Test build #94588 has finished for PR 22047 at commit
|
|
retest this please |
|
Test build #94602 has finished for PR 22047 at commit
|
6593cf4 to
291a13d
Compare
|
Test build #96872 has finished for PR 22047 at commit
|
|
Let me post something I wrote recently. Could you add test cases to ensure that we do not break the "Ignore NULLs" policy
|
|
@gatorsmile Thanks.. I will check. |
|
@gatorsmile First of all, thank you very much . Actually the added aggregates weren't null filtering. I have fixed the issue and have added additional test cases. Thank you. |
|
Test build #96968 has finished for PR 22047 at commit
|
|
retest this please |
|
Test build #96978 has finished for PR 22047 at commit
|
|
is it possible to rewrite these 3 new functions with existing expression? e.g. |
|
@cloud-fan please see my comment link. I had tried to rewrite using max and min as suggested by Herman and Reynold in the original pr. I was unable to do it when the aggregate is part of the window. |
|
Test build #96995 has finished for PR 22047 at commit
|
| return Column(jc) | ||
|
|
||
|
|
||
| def every(col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the SQL functions and remove the function APIs. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Hi Sean, I have prepared two branches. One in which these new aggregate functions are extending from the base Max and Min class basically reusing code. The other in which we replace these aggregate expressions in the optimizer. Below are the links.
I would prefer option 1 because of the following reasons.
- Code changes are simpler
- Supports these aggregates as window expressions naturally. In the other option i have
to block it. - It seems to me for these simple mapping, we probably don't need a rewrite frame work. We could add it in the future if we need a little complex transformation.
Please let me know how we want to move forward with this. Thanks !!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for option 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Thank you very much for your response. I will create a new PR based on option-1 today and close this one.
|
Test build #97731 has finished for PR 22047 at commit
|
|
Test build #97710 has finished for PR 22047 at commit
|
|
Test build #97798 has finished for PR 22047 at commit
|
| import org.apache.spark.sql.types._ | ||
|
|
||
| @ExpressionDescription( | ||
| usage = "_FUNC_(expr) - Returns true if at least one value of `expr` is true.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, don't forget to add since.
What changes were proposed in this pull request?
This PR is a rebased version of original work link by
@ptkool.
Please give credit to @ptkool for this work.
Description from original PR:
This pull request implements the EVERY and ANY aggregates.
How was this patch tested?
Testing was performed using unit tests, integration tests, and manual tests.