-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Stop using round-tripped PipelineAggregators #53423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This begins to clean up how `PipelineAggregator`s and executed. Previously, we would create the `PipelineAggregator`s on the data nodes and embed them in the aggregation tree. When it came time to execute the pipeline aggregation we'd use the `PipelineAggregator`s that were on the first shard's results. This is inefficient because: 1. The data node needs to make the `PipelineAggregator` only to serialize it and then throw it away. 2. The coordinating node needs to deserialize all of the `PipelineAggregator`s even though it only needs one of them. 3. You end up with many `PipelineAggregator` instances when you only really *need* one per pipeline. 4. `PipelineAggregator` needs to implement serialization. This begins to undo these by building the `PipelineAggregator`s directly on the coordinating node and using those instead of the `PipelineAggregator`s in the aggregtion tree. In a follow up change we'll stop serializing the `PipelineAggregator`s to node versions that support this behavior. And, one day, we'll be able to remove `PipelineAggregator` from the aggregation result tree entirely. Importantly, this doesn't change how pipeline aggregations are declared or parsed or requested. They are still part of the `AggregationBuilder` tree because *that* makes sense.
|
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
|
Wow! The docs tests failed. Interesting that we didn't have any other thing covering this case. I'll dig. |
That was surprisingly tricky to track down. And it leads to another thing that I'll need to do - shift validation from being based on |
In a follow up. Preserving the old "build the pipelines in the tree too" will allow me to work around this. |
Haven't had a chance to look at the PR yet (hopefully soon!) but yeah, I think this makes sense for a future PR. Aggs can't get away with this because they need to resolve fields, but pipelines only care about an agg being at the right location (and sometimes the right kind of agg) so the Builder tree should be sufficient 👍 |
| /** | ||
| * Returns a builder for {@link InternalAggregation.ReduceContext}. This | ||
| * builder retains a reference to the provided {@link SearchRequest}. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi, I remember us trying not to hold on to references to the SearchRequest because it could be big. Or something like that. Is that still a thing? It looks like we keep the SearchRequest around for a while during the search right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's totally ok since you refer to the original request which is unique per search. You also build the pipeline tree lazily which seems like a nice win to me.
jimczi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing :). I left minor comments but I think this will tremendously help the usage and extensibility of pipeline aggregators. They should be used and known by coordinating nodes only and this pr is a giant step in this direction.
| InternalAggregation.ReduceContextBuilder aggReduceContextBuilder = new InternalAggregation.ReduceContextBuilder() { | ||
| @Override | ||
| public ReduceContext forPartialReduction() { | ||
| throw new UnsupportedOperationException("Scroll requests don't have aggs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
| /** | ||
| * Returns a builder for {@link InternalAggregation.ReduceContext}. This | ||
| * builder retains a reference to the provided {@link SearchRequest}. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's totally ok since you refer to the original request which is unique per search. You also build the pipeline tree lazily which seems like a nice win to me.
| new InternalAggregation.ReduceContext(reduceContext.bigArrays(), reduceContext.scriptService(), true)); | ||
| // TODO it looks like this passes the "final" reduce context more than once. | ||
| // Once here and once in the for above. That is bound to cause trouble. | ||
| currentTree = InternalAggregations.reduce(Arrays.asList(currentTree, liveAggs), finalReduceContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, let's open an issue since it seems easy to fix rather than a TODO ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 indeed. Rollup doesn't work with pipelines anyhow (mostly due to the serialization issue, with different aggs being sent to rollup vs live indices, it messes up how pipelines operate).... but I could see multiple final reductions potentially hurting accuracy on certain aggs that care like terms
polyfractal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot :) Definitely a good step towards untangling pipelines!
| return EMPTY; | ||
| } | ||
| List<PipelineAggregationBuilder> orderedpipelineAggregators = null; | ||
| if (skipResolveOrder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're here, is it possible to nuke skipResolveOrder too? I believe it's only used by BasePipelineAggregationTestCase, and that doesn't even invoke build() so this is basically "dead" testing code. I think.
Not a problem to leave if there's a complication... I just particularly dislike this little tidbit and wouldn't mind seeing it go if we're already touching this :)
| new InternalAggregation.ReduceContext(reduceContext.bigArrays(), reduceContext.scriptService(), true)); | ||
| // TODO it looks like this passes the "final" reduce context more than once. | ||
| // Once here and once in the for above. That is bound to cause trouble. | ||
| currentTree = InternalAggregations.reduce(Arrays.asList(currentTree, liveAggs), finalReduceContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 indeed. Rollup doesn't work with pipelines anyhow (mostly due to the serialization issue, with different aggs being sent to rollup vs live indices, it messes up how pipelines operate).... but I could see multiple final reductions potentially hurting accuracy on certain aggs that care like terms
This begins to clean up how `PipelineAggregator`s and executed. Previously, we would create the `PipelineAggregator`s on the data nodes and embed them in the aggregation tree. When it came time to execute the pipeline aggregation we'd use the `PipelineAggregator`s that were on the first shard's results. This is inefficient because: 1. The data node needs to make the `PipelineAggregator` only to serialize it and then throw it away. 2. The coordinating node needs to deserialize all of the `PipelineAggregator`s even though it only needs one of them. 3. You end up with many `PipelineAggregator` instances when you only really *need* one per pipeline. 4. `PipelineAggregator` needs to implement serialization. This begins to undo these by building the `PipelineAggregator`s directly on the coordinating node and using those instead of the `PipelineAggregator`s in the aggregtion tree. In a follow up change we'll stop serializing the `PipelineAggregator`s to node versions that support this behavior. And, one day, we'll be able to remove `PipelineAggregator` from the aggregation result tree entirely. Importantly, this doesn't change how pipeline aggregations are declared or parsed or requested. They are still part of the `AggregationBuilder` tree because *that* makes sense.
…53629) This begins to clean up how `PipelineAggregator`s and executed. Previously, we would create the `PipelineAggregator`s on the data nodes and embed them in the aggregation tree. When it came time to execute the pipeline aggregation we'd use the `PipelineAggregator`s that were on the first shard's results. This is inefficient because: 1. The data node needs to make the `PipelineAggregator` only to serialize it and then throw it away. 2. The coordinating node needs to deserialize all of the `PipelineAggregator`s even though it only needs one of them. 3. You end up with many `PipelineAggregator` instances when you only really *need* one per pipeline. 4. `PipelineAggregator` needs to implement serialization. This begins to undo these by building the `PipelineAggregator`s directly on the coordinating node and using those instead of the `PipelineAggregator`s in the aggregtion tree. In a follow up change we'll stop serializing the `PipelineAggregator`s to node versions that support this behavior. And, one day, we'll be able to remove `PipelineAggregator` from the aggregation result tree entirely. Importantly, this doesn't change how pipeline aggregations are declared or parsed or requested. They are still part of the `AggregationBuilder` tree because *that* makes sense.
This begins to clean up how
PipelineAggregators and executed.Previously, we would create the
PipelineAggregators on the data nodesand embed them in the aggregation tree. When it came time to execute the
pipeline aggregation we'd use the
PipelineAggregators that were on thefirst shard's results. This is inefficient because:
PipelineAggregatoronly toserialize it and then throw it away.
PipelineAggregators even though it only needs one of them.PipelineAggregatorinstances when you onlyreally need one per pipeline.
PipelineAggregatorneeds to implement serialization.This begins to undo these by building the
PipelineAggregators directlyon the coordinating node and using those instead of the
PipelineAggregators in the aggregtion tree. In a follow up changewe'll stop serializing the
PipelineAggregators to node versions thatsupport this behavior. And, one day, we'll be able to remove
PipelineAggregatorfrom the aggregation result tree entirely.Importantly, this doesn't change how pipeline aggregations are declared
or parsed or requested. They are still part of the
AggregationBuildertree because that makes sense.