-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SQL] Improve SparkSQL Aggregates #683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
marmbrus
commented
May 7, 2014
- Add native min/max (was using hive before).
- Handle nulls correctly in Avg and Sum.
* Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.
|
Merged build triggered. |
|
Merged build started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unrelated to this pr - but I just realized the way we are storing the aggregation buffer in Spark SQL uses much more memory than needed, because there are two extra pointers to expr/base, which is identical for every tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, though this is not an issue in the code gen version.
On May 7, 2014 2:28 PM, "Reynold Xin" [email protected] wrote:
In
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala:@@ -86,6 +86,67 @@ abstract class AggregateFunction
override def newInstance() = makeCopy(productIterator.map { case a: AnyRef => a }.toArray)
}+case class Min(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression] {
- override def references = child.references
- override def nullable = child.nullable
- override def dataType = child.dataType
- override def toString = s"MIN($child)"
- override def asPartial: SplitEvaluation = {
- val partialMin = Alias(Min(child), "PartialMin")()
- SplitEvaluation(Min(partialMin.toAttribute), partialMin :: Nil)
- }
- override def newInstance() = new MinFunction(child, this)
+}
+case class MinFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction {
this is unrelated to this pr - but I just realized the way we are storing
the aggregation buffer in Spark SQL uses much more memory than needed,
because there are two extra pointers to expr/base, which is identical for
every tuple.—
Reply to this email directly or view it on GitHubhttps://github.com//pull/683/files#r12404003
.
|
LGTM |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
@pwendell, this should probably go in 1.0. |
|
Merged. |
* Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum. Author: Michael Armbrust <[email protected]> Closes #683 from marmbrus/aggFixes and squashes the following commits: 64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum. (cherry picked from commit 19c8fb0) Signed-off-by: Reynold Xin <[email protected]>
* Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum. Author: Michael Armbrust <[email protected]> Closes apache#683 from marmbrus/aggFixes and squashes the following commits: 64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.