-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-8641][SQL] Native Spark Window functions #9819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a small trick to allow us to add the ImperativeAggregate to the evaluation projection. The advantage of this is that we are avoiding the use of the relatively expensive generic update method and that we don't have to use a seperate indices array to keep track of the location to store the evaluation result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recently I also tried this trick, but failed, because the eval() usually only use attributes in the buffer, but BoundReference will try to look attributes for child of AggregateFunction, which may not exists.
Could you have a test case for it? (using AggregateFunction as window function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be fine as long as we add already bound ImperativeAggregates to the projection to be code generated. Unbound ImperativeAggregates will cause alot of trouble.
I use HyperLogLogPlusPlus in the last test in the DataFrameWindowFunctionSuite: https://github.com/hvanhovell/spark/blob/SPARK-8641-2/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowSuite.scala#L222 is this enough?
|
Test build #46251 has finished for PR 9819 at commit
|
|
Test build #46331 has finished for PR 9819 at commit
|
|
Test build #46338 has finished for PR 9819 at commit
|
|
Test build #46346 has finished for PR 9819 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withNewChildren does not work with AggregateExpression; I am working arround that here.
|
Test build #46820 has finished for PR 9819 at commit
|
|
@hvanhovell Thank you for the PR! Just a quick heads up. We will allocate time to review during next week (and the week after if we need more time to work on it). |
|
One quick question. With this PR, is it possible to use any Spark SQL's aggregate function as a window function? |
|
Yes. You can use any Spark aggregate function as a window function. Most Hive UDAFs should also work except for the pivoted ones... |
|
Test build #46998 has finished for PR 9819 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check if buckets is a foldable expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Yes and No. The buckets value only has to be constant within a partition, it would also work if the value is part of the partitioning clause. It is - however - quite a bit of work to get that in. For now I'd rather enforce a global constant number of buckets. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I somehow missed case x => throw new AnalysisException(... Sorry.
It makes sense. Let's keep it as is.
|
Do we have a test case that uses a UDAF as window function? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to say it is also used as the default frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, is it used as the default frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit more strict than that. It is the only frame in which a WindowFunction is supposed to be evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see.
|
Can you add scala doc to explain how we evaluate an regular agg function when it is used as a window function? (Maybe I missed it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 spaces? Or the colon without a result?
|
@hvanhovell This is very cool! I have finished my review. |
|
LGTM |
|
Test build #47676 has finished for PR 9819 at commit
|
|
Build failed due to R versioning problem. I'll try again when this is sorted out. |
|
@yhuai I fixed/addressed/improved most of the things you have raised. Two things worth pointing out: You can find the test for UDAF here: https://github.com/hvanhovell/spark/blob/SPARK-8641-2/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowSuite.scala#L240-L294 You can find the documentation on how we evaluate a regular |
|
retest this please |
|
Test build #47685 has finished for PR 9819 at commit
|
|
Test build #47726 has finished for PR 9819 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we can add some comments to explain how it works in a follow-up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some documentation on all the window functions. The inner workings of ntile in particular need some documentation.
|
Thank you @hvanhovell ! I am going to merge it. Let's have a follow-up PR to add more docs to those newly added functions. Also, can we add tests like the following? Basically, we test cases using |
…-up (docs & tests) This PR is a follow-up for PR #9819. It adds documentation for the window functions and a couple of NULL tests. The documentation was largely based on the documentation in (the source of) Hive and Presto: * https://prestodb.io/docs/current/functions/window.html * https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics I am not sure if we need to add the licenses of these two projects to the licenses directory. They are both under the ASL. srowen any thoughts? cc yhuai Author: Herman van Hovell <[email protected]> Closes #10402 from hvanhovell/SPARK-8641-docs.
This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features. This has the following advantages: * Better memory management. * The ability to use spark UDAFs in Window functions. cc rxin / yhuai Author: Herman van Hovell <[email protected]> Closes apache#9819 from hvanhovell/SPARK-8641-2.
This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features. This has the following advantages: * Better memory management. * The ability to use spark UDAFs in Window functions. cc rxin / yhuai Author: Herman van Hovell <[email protected]> Closes apache#9819 from hvanhovell/SPARK-8641-2.
This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features.
This has the following advantages:
cc @rxin / @yhuai