-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-14275][SQL] Reimplement TypedAggregateExpression to DeclarativeAggregate #12067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @yhuai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we have a problem, bufferSerializer is a list of expressions, and the reduced expression will appear many times. However, it should be only executed once, and it's may not good to depend on subexpression elimination to optimize this case.
cc @marmbrus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this fundamentally the same problem we are hitting when trying to codgen Map using Project or am I not understanding the issue?
I wonder if we could have a special expression that takes an object and turns it into a struct in one shot instead of exploding it to multiple columns? Just a random though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's the same problem.
turning an object to a struct in one shot doesn't solve the problem, as we need to flatten it to Seq[Expression] here, which will be a list of GetStructField and the "one shot" expression still appears many times.
Actually I think we don't need to do anything here. Although the reduced expression appears many times, they are same references and subexpression elimination should always work. Or we can introduce the Holder and Reference to make it more clear.
|
Test build #54533 has finished for PR 12067 at commit
|
|
Test build #54537 has finished for PR 12067 at commit
|
|
This is a great idea if we can get it to work. |
|
This reimplement works, but is inefficient, as the subexpression elimination is not supported in |
|
Test build #54621 has finished for PR 12067 at commit
|
|
cc @davies again - can you take a look at wenchen's question? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe IN, BUF, and OUT
we have mostly used upper case type parameters in Spark.
|
It would be awesome to run Spark SQL perf and see what the speed up is here after the elimination is fixed. You might even be able to do it directly from the Spark repo. |
|
@cloud-fan @marmbrus I think could do the similar trick in MapElements in TungstenAggregagte, evaluate the functions first, then replace them with the generated variables in update/merge expressions. |
…bjectOperator ## What changes were proposed in this pull request? This PR decouples deserializer expression resolution from `ObjectOperator`, so that we can use deserializer expression in normal operators. This is needed by #12061 and #12067 , I abstracted the logic out and put them in this PR to reduce code change in the future. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes #12131 from cloud-fan/separate.
|
generated code snippet in whole stage codegen for |
|
generated code snippet in mutable projection codegen for complex buffer type UDAF : |
|
Test build #55209 has finished for PR 12067 at commit
|
|
the benchmark result of master branch is extremely slow: not sure why... |
|
retest this please. |
|
cc @davies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marmbrus Yay!
|
@davies Can you review this? |
|
if we can reuse a single object and mutate the object in place, would it be the same speed? |
|
The part I don't get is that even in the RDD case, we'd need to create an object per row. This is equivalent to the "deserialization" in aggregator, since they both just call the ctor of the class. Why is RDD faster? |
|
In the benchmark, for RDD, we first apply a function to turn a long into a |
|
And I think "reuse a single object" should help, as then we only need to create one object for one partition. But it's like cheating, because RDD doesn't reuse the object, it's unfair to compare Dataset with RDD when we internally reuse the same object. |
|
Well it's not cheating if the user doesn't need to explicitly reuse. |
|
Test build #55298 has finished for PR 12067 at commit
|
|
Test build #55499 has finished for PR 12067 at commit
|
|
Test build #55506 has finished for PR 12067 at commit
|
|
Test build #55508 has finished for PR 12067 at commit
|
|
Test build #55509 has finished for PR 12067 at commit
|
|
@davies Can you review? |
| RDD 216 / 237 46.3 21.6 4.2X | ||
| RDD 1935 / 2105 51.7 19.3 1.0X | ||
| DataFrame 756 / 799 132.3 7.6 2.6X | ||
| Dataset 7359 / 7506 13.6 73.6 0.3X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did not see much performance improvements from this PR, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #12067 (comment)
It's hundreds times faster...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the old value of this? It's wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry I updated the benchmark code before, let me run it again on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, this line is for back-to-back map, but this PR aims to improve the aggregator case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this old benchmark cases according to rxin's comment: #12067 (comment)
| override def toString: String = { | ||
| s"""${aggregator.getClass.getSimpleName}(${children.mkString(",")})""" | ||
| val input = inputDeserializer match { | ||
| case Some(UnresolvedDeserializer(deserializer, _)) => deserializer.dataType.simpleString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is deserializer always resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry I missed here, this case should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove it when merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized this line is needed. The input deserializer is set by TypedColumn.withInputType and is unresolved at the first place.
|
LGTM, will merge this once it passed the tests. |
|
Test build #2788 has finished for PR 12067 at commit
|
|
@davies thanks for your review! merging to master! |
… type ## What changes were proposed in this pull request? After #12067, we now use expressions to do the aggregation in `TypedAggregateExpression`. To implement buffer merge, we produce a new buffer deserializer expression by replacing `AttributeReference` with right-side buffer attribute, like other `DeclarativeAggregate`s do, and finally combine the left and right buffer deserializer with `Invoke`. However, after #12338, we will add loop variable to class members when codegen `MapObjects`. If the `Aggregator` buffer type is `Seq`, which is implemented by `MapObjects` expression, we will add the same loop variable to class members twice(by left and right buffer deserializer), which cause the `ClassFormatError`. This PR fixes this issue by calling `distinct` before declare the class menbers. ## How was this patch tested? new regression test in `DatasetAggregatorSuite` Author: Wenchen Fan <[email protected]> Closes #12468 from cloud-fan/bug.
What changes were proposed in this pull request?
ExpressionEncoderis just a container for serialization and deserialization expressions, we can use these expressions to buildTypedAggregateExpressiondirectly, so that it can fit inDeclarativeAggregate, which is more efficient.One trick is, for each buffer serializer expression, it will reference to the result object of serialization and function call. To avoid re-calculating this result object, we can serialize the buffer object to a single struct field, so that we can use a special
Expressionto only evaluate result object once.How was this patch tested?
existing tests