-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Save memory when auto_date_histogram is not on top #57304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This rebuilds `auto_date_histogram`'s aggregator to function without `asMultiBucketAggregator` which should save a significant amount of memory when it is not the top most aggregator. It isn't possible to "just port the aggregator" without taking a pretty significant performance hit because we used to rewrite all of the buckets every time we switched to a coarser and coarser rounding configuration. Without some major surgery to how to delay sub-aggs we'd end up rewriting the delay list zillions of time if there are many buckets. This change replaces the constant rewrites with a "budget" of "wasted" buckets and only rewrites all of the buckets when we exceed that budget. Now that we don't rebucket every time we increase the rounding we can no longer get an accurate count of the number of buckets! So instead the aggregator uses an esimate of the number of buckets to trigger switching to a coarser rounding. This estimate is likely to be *terrible* when buckets are far apart compared to the rounding. So it also uses the difference between the first and last bucket to trigger switching to a coarser rounding. Which covers for the shortcomings of the bucket estimation technique pretty well. It also causes the aggregator to emit fewer buckets in cases where they'd be reduced together on the coordinating node. This is wonderful! But probably fairly rare. After all that, it amounts to about the same performance, in the benchmarks that I've run. But the memory savings is totaly still at thing! Relates to elastic#56487
|
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
|
Hey Nik, do you mind sharing some of the results from the benchmarks here for added context? |
talevy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks reasonable to me. left some comments
| * <p> | ||
| * Initially it uses the most fine grained rounding configuration possible but | ||
| * as more data arrives it uses two heuristics to shift to coarser and coarser | ||
| * rounding. The first heuristic is the number of buckets, specifically, it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the ending "it"?
| * <strong>that</strong> as the initial value for the bucket count that it | ||
| * increments in order to trigger another promotion to another coarser | ||
| * rounding. This works fairly well at containing the number of buckets, but | ||
| * it the estimate of the number of buckets will be wrong if the buckets are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the leading "it"?
| * <p> | ||
| * The second heuristic it uses to trigger promotion to a coarser rounding is | ||
| * the distance between the min and max bucket. When that distance is greater | ||
| * than what the current rounding supports it promotes. This is heuristic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop the "is"?
| private IntArray liveBucketCountUnderestimate; | ||
| /** | ||
| * An over estimate of the number of wasted buckets. When this gets | ||
| * too high we {@link #rebucket()} which sets it to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be AutoDateHistogramAggregator#rebucket? Intellij no-likey
| newEstimatedBucketCount = (int) Math.ceil(oldEstimatedBucketCount * ratio); | ||
| } while (newRounding < roundingInfos.length - 1 && ( | ||
| newEstimatedBucketCount > targetBuckets * roundingInfos[newRounding].getMaximumInnerInterval() | ||
| || max - min > targetBuckets * roundingInfos[newRounding].getMaximumRoughEstimateDurationMillis())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should there be unit tests for when:
(max - min > targetBuckets * roundingInfos[newRounding].getMaximumRoughEstimateDurationMillis()) == false
&& (newEstimatedBucketCount > targetBuckets * roundingInfos[newRounding].getMaximumInnerInterval()))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do cover this case in the test case, but there isn't a unit test for this exactly. do you think it is worth pulling this out somehow?
When the Some of that seems a bit fuzzy, but I think the performance hit is real, I think. When the agg is below a fairly low cardinality These amount to pretty much the same performance, though I have a suspicion that if I had something with a higher cardinality above the agg you'd see more benefit. The thing is, our high cardinality aggs are almost all date-based. The Having written it all down and formatted it well I'm less and less comfortable with the top level performance difference. Those tests just take so long to run! They use the NYC taxis data set which takes about five hours to run those tests. The non-top level tests use the NOAA data set which is comparatively smaller and faster to work with. The memory savings when the agg isn't at the top is super real, but I wonder if I can claw back some of that top level performance. The time_zone case is already going to be much faster because of #56384. But the non-time_zone case is totally a thing. |
polyfractal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some questions and a few comments :)
| } | ||
| bucketOrds = newBucketOrds; | ||
| } | ||
| long min = Math.min(bounds.get(owningBucketOrd * 2), newKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
owningBucketOrd * 2 and owningBucketOrd * 2 + 1 are used a lot here and make it a bit difficult to read... could we just set some lowerBound/upperBound (or whatever name makes sense) variables to make it easier to read?
| long oldSize = bounds.size(); | ||
| bounds = context.bigArrays().grow(bounds, owningBucketOrd * 2 + 2); | ||
| for (long b = oldSize; b < bounds.size(); b++) { | ||
| bounds.set(b, (b & 1L) == 0L ? Long.MAX_VALUE : Long.MIN_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure I understand what this is doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the min and max properly. It'd all go away if I make too arrays like you asked above. While I'll do.
| bucketOrds = newBucketOrds; | ||
| } | ||
| long min = Math.min(bounds.get(owningBucketOrd * 2), newKey); | ||
| bounds.set(owningBucketOrd * 2, min); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move these two bounds.set() inside the if() { ... } down below, to avoid unnecessary setting when we have to fall through and increase the rounding. Not that it's terribly expensive, but if we can avoid always good :)
| rebucket(); | ||
| // Bump the threshold for the next rebucketing | ||
| wastedBucketsOverestimate = 0; | ||
| nextRebucketAt *= 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale behind doubling the rebucketing threshold? Shouldn't it stay constant since the goal of the agg is to produce a relatively constant number of buckets? As we rebucket and move to coarser intervals, the number of buckets will reduce (and wasted buckets should get cleaned up over time)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to always produce good buckets in the end but to allow more and more "wasted" buckets as we end up cleaning up more and more. With the default as it stands when the agg is at the top level we don't clean up any of the "wasted" buckets until right before we return the aggs.
This bumping is more to defend against having large numbers of owningBucketOrds that cause us to have to rebucket many many times. To sort of blunt the O(n^2) nature of it all. We'd still to the O(n) rebucketing many times. Just fewer times as you accumulate more and more buckets.
I did think about making it a constant budget. And that works pretty well too because we naturally rebucket less and less as we promote rounding because the buckets are bigger. But if we have a ton of owningBucketOrds I'm kind of worried.
| long count; | ||
| long min = Long.MAX_VALUE; | ||
| long max = Long.MIN_VALUE; | ||
| do { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have the min/max in bounds at this point? Could we use that to help skip to the correct rounding, so we don't have to potentially re-build the ords several times if max-min > the threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do. Let me think about that. My instinct is that this check here is largely symbolic because if the min and max would have pushed us up then we'd, well, already be on a bigger rounding anyway.
...java/org/elasticsearch/search/aggregations/bucket/histogram/AutoDateHistogramAggregator.java
Show resolved
Hide resolved
|
After I dropping Without metrics or time zone: 25% faster |
|
The performance when the agg is at the top level improved some as well after I dropped |
|
@talevy and @polyfractal as you can see from my commit history I had to fight this one quite a bit. I managed to keep the performance in the single-bucket case, but it was hard fought. Please have a look |
|
run elasticsearch-ci/1 |
|
👀 |
polyfractal
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments but nothing major, think this looks good. Big changes to existing behavior always scary to merge but I don't see anything obviously incorrect, so LGTM from my perspective! :)
| * inner loop of {@link LeafBucketCollector#collect(int, long)}. You'd | ||
| * think that it wouldn't matter, but its seriously 7%-15% performance | ||
| * difference for the aggregation. Yikes. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, TIL.
| * but it does have the advantage of only touching the buckets that we | ||
| * want to collect. | ||
| */ | ||
| rebucket(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we might need to guard here in the situation where the agg doesn't actually generate any buckets (none of the collected docs end up having the field, no collected docs, etc), and return an empty agg instead.
Or alternatively, put the guard in rebucket().
I think BucketsAggregator#mergeBuckets() would be fine since doc counts would also be empty... but I'm less confident about MergingBucketsDeferringCollector#mergeBuckets(). Might be ok though, only skimmed it :) We've ran into this kind of issue in the past though where buildAgg broke when the agg runs and is effectively "unmapped" after running, so thought I'd mention :)
| * The number of times the aggregator had to {@link #rebucket()} the | ||
| * results. We keep this just to report to the profiler. | ||
| */ | ||
| private int rebucketCount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing rebucketing info (count, rebucketAt, wasted, etc) via collectDebugInfo would be super interesting from a profiling perspective. Especially if a user comes to us with performance issues, we'd have a better idea if they are hitting some kind of antagonistic scenario or something
Not super important, just a thought :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already exposed it all! I made this in the first place because I wanted to see how the algorithm did with certain bits of data with the profiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow I totally overlooked those. Ignore me! :)
…ic#57304) This builds an `auto_date_histogram` aggregator that natively aggregates from many buckets and uses it when the `auto_date_histogram` used to use `asMultiBucketAggregator` which should save a significant amount of memory in those cases. In particular, this happens when `auto_date_histogram` is a sub-aggregator of a multi-bucketing aggregator like `terms` or `histogram` or `filters`. For the most part we preserve the original implementation when `auto_date_histogram` only collects from a single bucket. It isn't possible to "just port the aggregator" without taking a pretty significant performance hit because we used to rewrite all of the buckets every time we switched to a coarser and coarser rounding configuration. Without some major surgery to how to delay sub-aggs we'd end up rewriting the delay list zillions of time if there are many buckets. The multi-bucket version of the aggregator has a "budget" of "wasted" buckets and only rewrites all of the buckets when we exceed that budget. Now that we don't rebucket every time we increase the rounding we can no longer get an accurate count of the number of buckets! So instead the aggregator uses an estimate of the number of buckets to trigger switching to a coarser rounding. This estimate is likely to be *terrible* when buckets are far apart compared to the rounding. So it also uses the difference between the first and last bucket to trigger switching to a coarser rounding. Which covers for the shortcomings of the bucket estimation technique pretty well. It also causes the aggregator to emit fewer buckets in cases where they'd be reduced together on the coordinating node. This is wonderful! But probably fairly rare. All of that does buy us some speed improvements when the aggregator is a child of multi-bucket aggregator: Without metrics or time zone: 25% faster With metrics: 15% faster With time zone: 22% faster Relates to elastic#56487
… (#58190) This builds an `auto_date_histogram` aggregator that natively aggregates from many buckets and uses it when the `auto_date_histogram` used to use `asMultiBucketAggregator` which should save a significant amount of memory in those cases. In particular, this happens when `auto_date_histogram` is a sub-aggregator of a multi-bucketing aggregator like `terms` or `histogram` or `filters`. For the most part we preserve the original implementation when `auto_date_histogram` only collects from a single bucket. It isn't possible to "just port the aggregator" without taking a pretty significant performance hit because we used to rewrite all of the buckets every time we switched to a coarser and coarser rounding configuration. Without some major surgery to how to delay sub-aggs we'd end up rewriting the delay list zillions of time if there are many buckets. The multi-bucket version of the aggregator has a "budget" of "wasted" buckets and only rewrites all of the buckets when we exceed that budget. Now that we don't rebucket every time we increase the rounding we can no longer get an accurate count of the number of buckets! So instead the aggregator uses an estimate of the number of buckets to trigger switching to a coarser rounding. This estimate is likely to be *terrible* when buckets are far apart compared to the rounding. So it also uses the difference between the first and last bucket to trigger switching to a coarser rounding. Which covers for the shortcomings of the bucket estimation technique pretty well. It also causes the aggregator to emit fewer buckets in cases where they'd be reduced together on the coordinating node. This is wonderful! But probably fairly rare. All of that does buy us some speed improvements when the aggregator is a child of multi-bucket aggregator: Without metrics or time zone: 25% faster With metrics: 15% faster With time zone: 22% faster Relates to #56487
I've always been confused by the strange behavior that I saw when working on elastic#57304. Specifically, I saw switching from a bimorphic invocation to a monomorphic invocation to give us a 7%-15% performance bump. This felt *bonkers* to me. And, it also made me wonder whether it'd be worth looking into doing it everywhere. It turns out that, no, it isn't needed everywhere. This benchmark shows that a bimorphic invocation like: ``` LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` is 19% slower than a monomorphic invocation like: ``` LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` But *only* when the reference is mutable. In the example above, if `ords` is never changed then both perform the same. But if the `ords` reference is assigned twice then we start to see the difference: ``` immutable bimorphic avgt 10 6.468 ± 0.045 ns/op immutable monomorphic avgt 10 6.756 ± 0.026 ns/op mutable bimorphic avgt 10 9.741 ± 0.073 ns/op mutable monomorphic avgt 10 8.190 ± 0.016 ns/op ``` So the conclusion from all this is that we've done the right thing: `auto_date_histogram` is the only aggregation in which `ords` isn't final and it is the only aggregation that forces monomorphic invocations. All other aggregations use an immutable bimorphic invocation. Which is fine. Relates to elastic#56487
I've always been confused by the strange behavior that I saw when working on #57304. Specifically, I saw switching from a bimorphic invocation to a monomorphic invocation to give us a 7%-15% performance bump. This felt *bonkers* to me. And, it also made me wonder whether it'd be worth looking into doing it everywhere. It turns out that, no, it isn't needed everywhere. This benchmark shows that a bimorphic invocation like: ``` LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` is 19% slower than a monomorphic invocation like: ``` LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` But *only* when the reference is mutable. In the example above, if `ords` is never changed then both perform the same. But if the `ords` reference is assigned twice then we start to see the difference: ``` immutable bimorphic avgt 10 6.468 ± 0.045 ns/op immutable monomorphic avgt 10 6.756 ± 0.026 ns/op mutable bimorphic avgt 10 9.741 ± 0.073 ns/op mutable monomorphic avgt 10 8.190 ± 0.016 ns/op ``` So the conclusion from all this is that we've done the right thing: `auto_date_histogram` is the only aggregation in which `ords` isn't final and it is the only aggregation that forces monomorphic invocations. All other aggregations use an immutable bimorphic invocation. Which is fine. Relates to #56487
I've always been confused by the strange behavior that I saw when working on elastic#57304. Specifically, I saw switching from a bimorphic invocation to a monomorphic invocation to give us a 7%-15% performance bump. This felt *bonkers* to me. And, it also made me wonder whether it'd be worth looking into doing it everywhere. It turns out that, no, it isn't needed everywhere. This benchmark shows that a bimorphic invocation like: ``` LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` is 19% slower than a monomorphic invocation like: ``` LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` But *only* when the reference is mutable. In the example above, if `ords` is never changed then both perform the same. But if the `ords` reference is assigned twice then we start to see the difference: ``` immutable bimorphic avgt 10 6.468 ± 0.045 ns/op immutable monomorphic avgt 10 6.756 ± 0.026 ns/op mutable bimorphic avgt 10 9.741 ± 0.073 ns/op mutable monomorphic avgt 10 8.190 ± 0.016 ns/op ``` So the conclusion from all this is that we've done the right thing: `auto_date_histogram` is the only aggregation in which `ords` isn't final and it is the only aggregation that forces monomorphic invocations. All other aggregations use an immutable bimorphic invocation. Which is fine. Relates to elastic#56487
I've always been confused by the strange behavior that I saw when working on #57304. Specifically, I saw switching from a bimorphic invocation to a monomorphic invocation to give us a 7%-15% performance bump. This felt *bonkers* to me. And, it also made me wonder whether it'd be worth looking into doing it everywhere. It turns out that, no, it isn't needed everywhere. This benchmark shows that a bimorphic invocation like: ``` LongKeyedBucketOrds ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` is 19% slower than a monomorphic invocation like: ``` LongKeyedBucketOrds.ForSingle ords = new LongKeyedBucketOrds.ForSingle(); ords.add(0, 0); <------ this line ``` But *only* when the reference is mutable. In the example above, if `ords` is never changed then both perform the same. But if the `ords` reference is assigned twice then we start to see the difference: ``` immutable bimorphic avgt 10 6.468 ± 0.045 ns/op immutable monomorphic avgt 10 6.756 ± 0.026 ns/op mutable bimorphic avgt 10 9.741 ± 0.073 ns/op mutable monomorphic avgt 10 8.190 ± 0.016 ns/op ``` So the conclusion from all this is that we've done the right thing: `auto_date_histogram` is the only aggregation in which `ords` isn't final and it is the only aggregation that forces monomorphic invocations. All other aggregations use an immutable bimorphic invocation. Which is fine. Relates to #56487
This builds an
auto_date_histogramaggregator that natively aggregatesfrom many buckets and uses it when the
auto_date_histogramused to useasMultiBucketAggregatorwhich should save a significant amount ofmemory in those cases. In particular, this happens when
auto_date_histogramis a sub-aggregator of a multi-bucketing aggregatorlike
termsorhistogramorfilters. For the most part we preservethe original implementation when
auto_date_histogramonly collects froma single bucket.
It isn't possible to "just port the aggregator" without taking a pretty
significant performance hit because we used to rewrite all of the
buckets every time we switched to a coarser and coarser rounding
configuration. Without some major surgery to how to delay sub-aggs
we'd end up rewriting the delay list zillions of time if there are many
buckets.
The multi-bucket version of the aggregator has a "budget" of "wasted"
buckets and only rewrites all of the buckets when we exceed that budget.
Now that we don't rebucket every time we increase the rounding we can no
longer get an accurate count of the number of buckets! So instead the
aggregator uses an estimate of the number of buckets to trigger switching
to a coarser rounding. This estimate is likely to be terrible when
buckets are far apart compared to the rounding. So it also uses the
difference between the first and last bucket to trigger switching to a
coarser rounding. Which covers for the shortcomings of the bucket
estimation technique pretty well. It also causes the aggregator to emit
fewer buckets in cases where they'd be reduced together on the
coordinating node. This is wonderful! But probably fairly rare.
All of that does buy us some speed improvements when the aggregator is
a child of multi-bucket aggregator:
Without metrics or time zone: 25% faster
With metrics: 15% faster
With time zone: 22% faster
Relates to #56487
EDIT: This used to say:
This rebuilds
auto_date_histogram's aggregator to function withoutasMultiBucketAggregatorwhich should save a significant amount ofmemory when it is not the top most aggregator.
It isn't possible to "just port the aggregator" without taking a pretty
significant performance hit because we used to rewrite all of the
buckets every time we switched to a coarser and coarser rounding
configuration. Without some major surgery to how to delay sub-aggs
we'd end up rewriting the delay list zillions of time if there are many
buckets.
This change replaces the constant rewrites with a "budget" of "wasted"
buckets and only rewrites all of the buckets when we exceed that budget.
Now that we don't rebucket every time we increase the rounding we can no
longer get an accurate count of the number of buckets! So instead the
aggregator uses an estimate of the number of buckets to trigger switching
to a coarser rounding. This estimate is likely to be terrible when
buckets are far apart compared to the rounding. So it also uses the
difference between the first and last bucket to trigger switching to a
coarser rounding. Which covers for the shortcomings of the bucket
estimation technique pretty well. It also causes the aggregator to emit
fewer buckets in cases where they'd be reduced together on the
coordinating node. This is wonderful! But probably fairly rare.
After all that, it amounts to about the same performance, in the
benchmarks that I've run. But the memory savings is totaly still at
thing!
Relates to #56487