Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Mar 11, 2020

This begins to clean up how PipelineAggregators and executed.
Previously, we would create the PipelineAggregators on the data nodes
and embed them in the aggregation tree. When it came time to execute the
pipeline aggregation we'd use the PipelineAggregators that were on the
first shard's results. This is inefficient because:

  1. The data node needs to make the PipelineAggregator only to
    serialize it and then throw it away.
  2. The coordinating node needs to deserialize all of the
    PipelineAggregators even though it only needs one of them.
  3. You end up with many PipelineAggregator instances when you only
    really need one per pipeline.
  4. PipelineAggregator needs to implement serialization.

This begins to undo these by building the PipelineAggregators directly
on the coordinating node and using those instead of the
PipelineAggregators in the aggregtion tree. In a follow up change
we'll stop serializing the PipelineAggregators to node versions that
support this behavior. And, one day, we'll be able to remove
PipelineAggregator from the aggregation result tree entirely.

Importantly, this doesn't change how pipeline aggregations are declared
or parsed or requested. They are still part of the AggregationBuilder
tree because that makes sense.

This begins to clean up how `PipelineAggregator`s and executed.
Previously, we would create the `PipelineAggregator`s on the data nodes
and embed them in the aggregation tree. When it came time to execute the
pipeline aggregation we'd use the `PipelineAggregator`s that were on the
first shard's results. This is inefficient because:
1. The data node needs to make the `PipelineAggregator` only to
   serialize it and then throw it away.
2. The coordinating node needs to deserialize all of the
   `PipelineAggregator`s even though it only needs one of them.
3. You end up with many `PipelineAggregator` instances when you only
   really *need* one per pipeline.
4. `PipelineAggregator` needs to implement serialization.

This begins to undo these by building the `PipelineAggregator`s directly
on the coordinating node and using those instead of the
`PipelineAggregator`s in the aggregtion tree. In a follow up change
we'll stop serializing the `PipelineAggregator`s to node versions that
support this behavior. And, one day, we'll be able to remove
`PipelineAggregator` from the aggregation result tree entirely.

Importantly, this doesn't change how pipeline aggregations are declared
or parsed or requested. They are still part of the `AggregationBuilder`
tree because *that* makes sense.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@nik9000
Copy link
Member Author

nik9000 commented Mar 11, 2020

Wow! The docs tests failed. Interesting that we didn't have any other thing covering this case. I'll dig.

@nik9000
Copy link
Member Author

nik9000 commented Mar 11, 2020

Wow! The docs tests failed. Interesting that we didn't have any other thing covering this case. I'll dig.

That was surprisingly tricky to track down. And it leads to another thing that I'll need to do - shift validation from being based on Aggregators to based on AggregatorBuilders.

@nik9000
Copy link
Member Author

nik9000 commented Mar 11, 2020

That was surprisingly tricky to track down. And it leads to another thing that I'll need to do - shift validation from being based on Aggregators to based on AggregatorBuilders.

In a follow up. Preserving the old "build the pipelines in the tree too" will allow me to work around this.

@polyfractal
Copy link
Contributor

And it leads to another thing that I'll need to do - shift validation from being based on Aggregators to based on AggregatorBuilders.

Haven't had a chance to look at the PR yet (hopefully soon!) but yeah, I think this makes sense for a future PR. Aggs can't get away with this because they need to resolve fields, but pipelines only care about an agg being at the right location (and sometimes the right kind of agg) so the Builder tree should be sufficient 👍

/**
* Returns a builder for {@link InternalAggregation.ReduceContext}. This
* builder retains a reference to the provided {@link SearchRequest}.
*/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi, I remember us trying not to hold on to references to the SearchRequest because it could be big. Or something like that. Is that still a thing? It looks like we keep the SearchRequest around for a while during the search right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally ok since you refer to the original request which is unique per search. You also build the pipeline tree lazily which seems like a nice win to me.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing :). I left minor comments but I think this will tremendously help the usage and extensibility of pipeline aggregators. They should be used and known by coordinating nodes only and this pr is a giant step in this direction.

InternalAggregation.ReduceContextBuilder aggReduceContextBuilder = new InternalAggregation.ReduceContextBuilder() {
@Override
public ReduceContext forPartialReduction() {
throw new UnsupportedOperationException("Scroll requests don't have aggs");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

/**
* Returns a builder for {@link InternalAggregation.ReduceContext}. This
* builder retains a reference to the provided {@link SearchRequest}.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally ok since you refer to the original request which is unique per search. You also build the pipeline tree lazily which seems like a nice win to me.

new InternalAggregation.ReduceContext(reduceContext.bigArrays(), reduceContext.scriptService(), true));
// TODO it looks like this passes the "final" reduce context more than once.
// Once here and once in the for above. That is bound to cause trouble.
currentTree = InternalAggregations.reduce(Arrays.asList(currentTree, liveAggs), finalReduceContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, let's open an issue since it seems easy to fix rather than a TODO ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 indeed. Rollup doesn't work with pipelines anyhow (mostly due to the serialization issue, with different aggs being sent to rollup vs live indices, it messes up how pipelines operate).... but I could see multiple final reductions potentially hurting accuracy on certain aggs that care like terms

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this a lot :) Definitely a good step towards untangling pipelines!

return EMPTY;
}
List<PipelineAggregationBuilder> orderedpipelineAggregators = null;
if (skipResolveOrder) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're here, is it possible to nuke skipResolveOrder too? I believe it's only used by BasePipelineAggregationTestCase, and that doesn't even invoke build() so this is basically "dead" testing code. I think.

Not a problem to leave if there's a complication... I just particularly dislike this little tidbit and wouldn't mind seeing it go if we're already touching this :)

new InternalAggregation.ReduceContext(reduceContext.bigArrays(), reduceContext.scriptService(), true));
// TODO it looks like this passes the "final" reduce context more than once.
// Once here and once in the for above. That is bound to cause trouble.
currentTree = InternalAggregations.reduce(Arrays.asList(currentTree, liveAggs), finalReduceContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 indeed. Rollup doesn't work with pipelines anyhow (mostly due to the serialization issue, with different aggs being sent to rollup vs live indices, it messes up how pipelines operate).... but I could see multiple final reductions potentially hurting accuracy on certain aggs that care like terms

@nik9000 nik9000 merged commit 4d81edb into elastic:master Mar 16, 2020
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Mar 16, 2020
This begins to clean up how `PipelineAggregator`s and executed.
Previously, we would create the `PipelineAggregator`s on the data nodes
and embed them in the aggregation tree. When it came time to execute the
pipeline aggregation we'd use the `PipelineAggregator`s that were on the
first shard's results. This is inefficient because:
1. The data node needs to make the `PipelineAggregator` only to
   serialize it and then throw it away.
2. The coordinating node needs to deserialize all of the
   `PipelineAggregator`s even though it only needs one of them.
3. You end up with many `PipelineAggregator` instances when you only
   really *need* one per pipeline.
4. `PipelineAggregator` needs to implement serialization.

This begins to undo these by building the `PipelineAggregator`s directly
on the coordinating node and using those instead of the
`PipelineAggregator`s in the aggregtion tree. In a follow up change
we'll stop serializing the `PipelineAggregator`s to node versions that
support this behavior. And, one day, we'll be able to remove
`PipelineAggregator` from the aggregation result tree entirely.

Importantly, this doesn't change how pipeline aggregations are declared
or parsed or requested. They are still part of the `AggregationBuilder`
tree because *that* makes sense.
nik9000 added a commit that referenced this pull request Mar 16, 2020
…53629)

This begins to clean up how `PipelineAggregator`s and executed.
Previously, we would create the `PipelineAggregator`s on the data nodes
and embed them in the aggregation tree. When it came time to execute the
pipeline aggregation we'd use the `PipelineAggregator`s that were on the
first shard's results. This is inefficient because:
1. The data node needs to make the `PipelineAggregator` only to
   serialize it and then throw it away.
2. The coordinating node needs to deserialize all of the
   `PipelineAggregator`s even though it only needs one of them.
3. You end up with many `PipelineAggregator` instances when you only
   really *need* one per pipeline.
4. `PipelineAggregator` needs to implement serialization.

This begins to undo these by building the `PipelineAggregator`s directly
on the coordinating node and using those instead of the
`PipelineAggregator`s in the aggregtion tree. In a follow up change
we'll stop serializing the `PipelineAggregator`s to node versions that
support this behavior. And, one day, we'll be able to remove
`PipelineAggregator` from the aggregation result tree entirely.

Importantly, this doesn't change how pipeline aggregations are declared
or parsed or requested. They are still part of the `AggregationBuilder`
tree because *that* makes sense.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants