[SPARK-8641][SQL] Native Spark Window functions #9819

hvanhovell · 2015-11-18T21:22:35Z

This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features.

This has the following advantages:

Better memory management.
The ability to use spark UDAFs in Window functions.

cc @rxin / @yhuai

yhuai · 2015-11-18T21:44:43Z

test this please

hvanhovell · 2015-11-18T22:24:42Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

This is a small trick to allow us to add the ImperativeAggregate to the evaluation projection. The advantage of this is that we are avoiding the use of the relatively expensive generic update method and that we don't have to use a seperate indices array to keep track of the location to store the evaluation result.

Recently I also tried this trick, but failed, because the eval() usually only use attributes in the buffer, but BoundReference will try to look attributes for child of AggregateFunction, which may not exists.

Could you have a test case for it? (using AggregateFunction as window function)

We should be fine as long as we add already bound ImperativeAggregates to the projection to be code generated. Unbound ImperativeAggregates will cause alot of trouble.

I use HyperLogLogPlusPlus in the last test in the DataFrameWindowFunctionSuite: https://github.com/hvanhovell/spark/blob/SPARK-8641-2/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowSuite.scala#L222 is this enough?

SparkQA · 2015-11-19T00:14:24Z

Test build #46251 has finished for PR 9819 at commit b66ef4d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * trait WindowFunction extends Expression\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist(n: Expression)\n * case class NTile(buckets: Expression, n: Expression)\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(order: Seq[Expression]) extends RankLike\n * case class DenseRank(order: Seq[Expression]) extends RankLike\n * case class PercentRank(order: Seq[Expression], n: Expression) extends RankLike\n

…ckage.

SparkQA · 2015-11-19T16:29:14Z

Test build #46331 has finished for PR 9819 at commit 6ebee15.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * trait WindowFunction extends Expression\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist(n: Expression)\n * case class NTile(buckets: Expression, n: Expression)\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(order: Seq[Expression]) extends RankLike\n * case class DenseRank(order: Seq[Expression]) extends RankLike\n * case class PercentRank(order: Seq[Expression], n: Expression) extends RankLike\n

SparkQA · 2015-11-19T19:27:03Z

Test build #46338 has finished for PR 9819 at commit b3f5a39.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * trait WindowFunction extends Expression\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist(n: Expression)\n * case class NTile(buckets: Expression, n: Expression)\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(order: Seq[Expression]) extends RankLike\n * case class DenseRank(order: Seq[Expression]) extends RankLike\n * case class PercentRank(order: Seq[Expression], n: Expression) extends RankLike\n

SparkQA · 2015-11-19T21:41:18Z

Test build #46346 has finished for PR 9819 at commit e95c42e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * trait WindowFunction extends Expression\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist(n: Expression)\n * case class NTile(buckets: Expression, n: Expression)\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(order: Seq[Expression]) extends RankLike\n * case class DenseRank(order: Seq[Expression]) extends RankLike\n * case class PercentRank(order: Seq[Expression], n: Expression) extends RankLike\n

hvanhovell · 2015-11-23T20:53:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

withNewChildren does not work with AggregateExpression; I am working arround that here.

SparkQA · 2015-11-27T14:30:46Z

Test build #46820 has finished for PR 9819 at commit 31c6fb3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-11-29T22:29:24Z

@hvanhovell Thank you for the PR! Just a quick heads up. We will allocate time to review during next week (and the week after if we need more time to work on it).

yhuai · 2015-11-29T22:32:46Z

One quick question. With this PR, is it possible to use any Spark SQL's aggregate function as a window function?

hvanhovell · 2015-11-29T22:37:48Z

Yes. You can use any Spark aggregate function as a window function. Most Hive UDAFs should also work except for the pivoted ones...

SparkQA · 2015-12-01T21:59:38Z

Test build #46998 has finished for PR 9819 at commit d7f13a0.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * trait WindowFunction extends Expression\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist(n: Expression)\n * case class NTile(buckets: Expression, n: Expression)\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(order: Seq[Expression]) extends RankLike\n * case class DenseRank(order: Seq[Expression]) extends RankLike\n * case class PercentRank(order: Seq[Expression], n: Expression) extends RankLike\n

yhuai · 2015-12-13T23:05:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

Do we need to check if buckets is a foldable expression?

Good point. Yes and No. The buckets value only has to be constant within a partition, it would also work if the value is part of the partitioning clause. It is - however - quite a bit of work to get that in. For now I'd rather enforce a global constant number of buckets. What do you think?

oh, I somehow missed case x => throw new AnalysisException(... Sorry.

It makes sense. Let's keep it as is.

yhuai · 2015-12-13T23:38:31Z

Do we have a test case that uses a UDAF as window function?

yhuai · 2015-12-13T23:54:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

I guess we need to say it is also used as the default frame?

Actually, is it used as the default frame?

It is a bit more strict than that. It is the only frame in which a WindowFunction is supposed to be evaluated.

yhuai · 2015-12-13T23:59:20Z

Can you add scala doc to explain how we evaluate an regular agg function when it is used as a window function? (Maybe I missed it)

yhuai · 2015-12-14T00:06:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala

4 spaces? Or the colon without a result?

yhuai · 2015-12-14T00:14:50Z

@hvanhovell This is very cool! I have finished my review.

davies · 2015-12-14T06:44:54Z

LGTM

SparkQA · 2015-12-14T22:30:54Z

Test build #47676 has finished for PR 9819 at commit b4d9ca9.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * case class UnresolvedWindowExpression(\n * case class WindowExpression(\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction\n * case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindowFunction\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(children: Seq[Expression]) extends RankLike\n * case class DenseRank(children: Seq[Expression]) extends RankLike\n * case class PercentRank(children: Seq[Expression]) extends RankLike with SizeBasedWindowFunction\n

hvanhovell · 2015-12-14T22:41:04Z

Build failed due to R versioning problem. I'll try again when this is sorted out.

hvanhovell · 2015-12-14T22:56:52Z

@yhuai I fixed/addressed/improved most of the things you have raised. Two things worth pointing out:

You can find the test for UDAF here: https://github.com/hvanhovell/spark/blob/SPARK-8641-2/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowSuite.scala#L240-L294

You can find the documentation on how we evaluate a regular AggregateFunction here: https://github.com/hvanhovell/spark/blob/SPARK-8641-2/sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala#L705-L712

hvanhovell · 2015-12-14T22:57:03Z

retest this please

SparkQA · 2015-12-15T00:39:48Z

Test build #47685 has finished for PR 9819 at commit b4d9ca9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * case class UnresolvedWindowExpression(\n * case class WindowExpression(\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction\n * case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindowFunction\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(children: Seq[Expression]) extends RankLike\n * case class DenseRank(children: Seq[Expression]) extends RankLike\n * case class PercentRank(children: Seq[Expression]) extends RankLike with SizeBasedWindowFunction\n

SparkQA · 2015-12-15T12:16:21Z

Test build #47726 has finished for PR 9819 at commit c181c8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * case class UnresolvedWindowExpression(\n * case class WindowExpression(\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction\n * case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindowFunction\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(children: Seq[Expression]) extends RankLike\n * case class DenseRank(children: Seq[Expression]) extends RankLike\n * case class PercentRank(children: Seq[Expression]) extends RankLike with SizeBasedWindowFunction\n

yhuai · 2015-12-17T22:33:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

Seems we can add some comments to explain how it works in a follow-up PR?

I'll add some documentation on all the window functions. The inner workings of ntile in particular need some documentation.

yhuai · 2015-12-17T23:15:08Z

Thank you @hvanhovell ! I am going to merge it. Let's have a follow-up PR to add more docs to those newly added functions. Also, can we add tests like the following?

sql(s"""
           |select  p_mfgr,p_name, p_size,
           |avg(null) over(distribute by p_mfgr sort by p_name) as avg
           |from part
      """.stripMargin).show

    val df = Seq(("a", 1), ("a", 1), ("a", 2), ("a", 2), ("b", 4), ("b", 3), ("b", 2))
      .toDF("key", "value")
    val window = Window.orderBy()
      df.select(
          $"key", $"value",
          sum(lit(null)).over(window)).show

Basically, we test cases using null literals as the argument of a window function (I tested them manually and the results look good).

hvanhovell · 2015-12-20T14:02:02Z

@yhuai & @davies thanks for the reviews!

…-up (docs & tests) This PR is a follow-up for PR #9819. It adds documentation for the window functions and a couple of NULL tests. The documentation was largely based on the documentation in (the source of) Hive and Presto: * https://prestodb.io/docs/current/functions/window.html * https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics I am not sure if we need to add the licenses of these two projects to the licenses directory. They are both under the ASL. srowen any thoughts? cc yhuai Author: Herman van Hovell <[email protected]> Closes #10402 from hvanhovell/SPARK-8641-docs.

This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features. This has the following advantages: * Better memory management. * The ability to use spark UDAFs in Window functions. cc rxin / yhuai Author: Herman van Hovell <[email protected]> Closes apache#9819 from hvanhovell/SPARK-8641-2.

hvanhovell added 3 commits November 18, 2015 19:29

Initial commit - compiles and pases initial tests.

c70a223

Add a separate window function map and some types.

3fef5b4

Fix NTile

b66ef4d

hvanhovell reviewed Nov 18, 2015
View reviewed changes

hvanhovell added 3 commits November 19, 2015 14:27

Move partition field into the buffer.

ada52b6

Improvements to Window Functions.

9918e78

Disabled std.dev Hive tests. Move DataFrameWindowSuite to core SQL pa…

6ebee15

…ckage.

Fix AnalysisErrorSuite

b3f5a39

Style/Message/NTile validation fix.

e95c42e

hvanhovell reviewed Nov 23, 2015
View reviewed changes

hvanhovell added 5 commits November 27, 2015 13:13

Merge remote-tracking branch 'spark/master' into SPARK-8641-2

a43543e

Fix documentation headers.

20f5088

Revert HiveQl import change.

009280c

More Doc style fixes.

439b37b

Update docs.

31c6fb3

merge upstream

d7f13a0

yhuai reviewed Dec 13, 2015
View reviewed changes

yhuai reviewed Dec 14, 2015
View reviewed changes

Address comment by yhuai.

b4d9ca9

Add missing Hive tests.

c181c8b

yhuai reviewed Dec 17, 2015
View reviewed changes

asfgit closed this in 658f66e Dec 17, 2015

hvanhovell mentioned this pull request Dec 20, 2015

[SPARK-8641][SPARK-12455][SQL] Native Spark Window functions - Follow-up (docs & tests) #10402

Closed

[SPARK-8641][SQL] Native Spark Window functions #9819

[SPARK-8641][SQL] Native Spark Window functions #9819

Uh oh!

Conversation

hvanhovell commented Nov 18, 2015

Uh oh!

yhuai commented Nov 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 19, 2015

Uh oh!

SparkQA commented Nov 19, 2015

Uh oh!

SparkQA commented Nov 19, 2015

Uh oh!

SparkQA commented Nov 19, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 27, 2015

Uh oh!

yhuai commented Nov 29, 2015

Uh oh!

yhuai commented Nov 29, 2015

Uh oh!

hvanhovell commented Nov 29, 2015

Uh oh!

SparkQA commented Dec 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented Dec 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented Dec 13, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented Dec 14, 2015

Uh oh!

davies commented Dec 14, 2015

Uh oh!

SparkQA commented Dec 14, 2015

Uh oh!

hvanhovell commented Dec 14, 2015

Uh oh!

hvanhovell commented Dec 14, 2015

Uh oh!

hvanhovell commented Dec 14, 2015

Uh oh!

SparkQA commented Dec 15, 2015

Uh oh!

SparkQA commented Dec 15, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhuai commented Dec 17, 2015

Uh oh!

hvanhovell commented Dec 20, 2015

Uh oh!