-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Improve vwh's distant bucket handling #59094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve vwh's distant bucket handling #59094
Conversation
This modifies the `variable_width_histogram`'s distant bucket handling to: 1. Properly handle integer overflows 2. Recalculate the average distance when new buckets are added on the ends. This should slow down the rate at which we build extra buckets as we build more of them.
|
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
|
@jamesdorfman, you may also want to have a look at this one. |
|
I've labeled this |
jamesdorfman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool PR! I love the idea of updating the average distance dynamically. This is really elegant :)
| } | ||
|
|
||
| private void updateAvgBucketDistance() { | ||
| // Centroids are sorted so the average distance is the difference between the first and last. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really interesting observation! It took me a bit of thinking to convince myself that this is equivalent to the previous equation, but I agree that this makes sense.
The next question I had was whether this is the right metric to use after all, since this means that if a bucket is placed between the first and last bucket, its actual location doesn't affect the avgBucketDistance.
But upon further consideration I think it definitely makes sense, since this is still just measuring the distance between buckets.
| } | ||
|
|
||
| private void updateAvgBucketDistanceIfModified(int modifiedBucketOrd) { | ||
| if (modifiedBucketOrd == 0 || modifiedBucketOrd == numClusters - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct for when a centroid is modified. But numClusters is the denominator of the avgBucketDistance equation. So when a new bucket is added I think the distance should be updated regardless of the bucket's position. In that case maybe you can just call updateAvgBucketDistance() directly?
This makes sense to me intuitively. If a bucket is added within the existing range of buckets, this should decrease the average bucket distance, since there are more buckets in the same range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
|
Was it a bug that |
|
I wouldn't really call it a bug. Initially it was simpler to only calculate But it definitely makes sense to do it now, since this PR really simplifies updating |
|
I think the |
|
Thanks for reviewing @jpountz and @jamesdorfman! |
This modifies the `variable_width_histogram`'s distant bucket handling to: 1. Properly handle integer overflows 2. Recalculate the average distance when new buckets are added on the ends. This should slow down the rate at which we build extra buckets as we build more of them.
This modifies the `variable_width_histogram`'s distant bucket handling to: 1. Properly handle integer overflows 2. Recalculate the average distance when new buckets are added on the ends. This should slow down the rate at which we build extra buckets as we build more of them. Co-authored-by: Elastic Machine <[email protected]>
This modifies the
variable_width_histogram's distant bucket handlingto:
ends. This should slow down the rate at which we build extra buckets
as we build more of them.