Skip to content

Conversation

@liancheng
Copy link
Contributor

This WIP PR is refactored from PR #2953. Please refer to the original PR description for features implemented and not implemented in this PR.

The original PR was a huge one, commenting on each issue could be very time consuming. After offline discussions with @guowei2, I decided to work on a refactoring branch to fix most minor issues first and then start discussion based on this refactored version.

Major issues left in this PR are:

  1. Window spec is added to aggregation functions with a var, which breaks query plan immutability.
  2. When used with window specs, common aggregation functions like COUNT, SUM, AVG etc are not translated into Hive aggregation functions rather than Spark SQL builtin implementations.
  3. Execution code (execution.WindowFunction) can be further simplified.

Review on Reviewable

@SparkQA
Copy link

SparkQA commented Dec 15, 2014

Test build #24462 has started for PR 3703 at commit 922a8b9.

  • This patch merges cleanly.

@liancheng
Copy link
Contributor Author

Comments from the review on Reviewable.io


Note that instead of whitelisting window function test cases in HiveCompatibilitySuite, a new HiveWindowFunctionSuite was added. This is because the current Spark SQL HiveQl parser doesn't handle comments, and window function test input files come with Hive contains comment lines.


@liancheng
Copy link
Contributor Author

Comments from the review on Reviewable.io


sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala, line 30 [r1] (raw file):
This can be problematic. Ideally every aggregation function that can be used with window should have a windowSpec: Option[WindowSpec] field which defaults to None, and a withWindowSpec method that returns a new instance of the aggregation function object itself with a window spec.


sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala, line 874 [r1] (raw file):
The thread-local windowDefs map is used to store window definitions (w1, w2 and w3) in queries like this:

SELECT
    p_mfgr, p_name, p_size,
    SUM(p_size) OVER w1 AS s1,
    SUM(p_size) OVER w2 AS s2,
    SUM(p_size) OVER (w3 ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)  AS s3
FROM
    part
WINDOW
    w1 AS (DISTRIBUTE BY p_mfgr SORT BY p_size RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING),
    w2 AS w3,
    w3 AS (DISTRIBUTE BY p_mfgr SORT BY p_size RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

This map is cleaned and refilled in collectWindowDefs below, so it doesn't grow indefinitely.


sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala, line 1060 [r1] (raw file):
All builtin aggregation functions need a similar case clause to handle their windowed version. Otherwise they all fallback to Hive UDAF implementations.

COUNT is picked here because its Hive version GenericUDAFCount implements GenericUDAFResolver2 rather than AbstractGenericUDAFResolver, and is not handled by HiveFunctionRegistry.lookupFunction.


@SparkQA
Copy link

SparkQA commented Dec 15, 2014

Test build #24462 has finished for PR 3703 at commit 922a8b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class WindowSpec(windowPartition: WindowPartition, windowFrame: Option[WindowFrame])
    • case class WindowPartition(partitionBy: Seq[Expression], sortBy: Seq[SortOrder])
    • case class WindowFrame(frameType: FrameType, preceding: Int, following: Int)
    • abstract class AggregateExpression extends Expression with Serializable
    • case class WindowFunction(
    • case class WindowFunction(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24462/
Test PASSed.

@marmbrus
Copy link
Contributor

Should we close this issue in favor of #3703 ?

@asfgit asfgit closed this in ca12608 Dec 17, 2014
@nchammas
Copy link
Contributor

Should we close this issue in favor of #3703 ?

@marmbrus This is #3703. Did you mean another PR?

@mallman
Copy link
Contributor

mallman commented Mar 19, 2015

I'm confused. Why was this PR abruptly closed? Was there another active PR for window functions?

@srowen
Copy link
Member

srowen commented Mar 19, 2015

I think that might have been a mistake and #2953 was supposed to be closed. Since we are unable to close PRs, there is (long story) a process that eventually closes PRs that have a comment like "mind closing this pr". That's why it got auto-closed then.

That said I don't otherwise know whether this was going to proceed anyway. I don't see other PRs for this JIRA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants