Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented Jul 6, 2020

This modifies the variable_width_histogram's distant bucket handling
to:

  1. Properly handle integer overflows
  2. Recalculate the average distance when new buckets are added on the
    ends. This should slow down the rate at which we build extra buckets
    as we build more of them.

This modifies the `variable_width_histogram`'s distant bucket handling
to:
1. Properly handle integer overflows
2. Recalculate the average distance when new buckets are added on the
   ends. This should slow down the rate at which we build extra buckets
   as we build more of them.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 6, 2020
@nik9000
Copy link
Member Author

nik9000 commented Jul 6, 2020

@jamesdorfman, you may also want to have a look at this one.

@nik9000
Copy link
Member Author

nik9000 commented Jul 6, 2020

I've labeled this >non-issue because it is a bug in an unreleased feature.

Copy link
Contributor

@jamesdorfman jamesdorfman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool PR! I love the idea of updating the average distance dynamically. This is really elegant :)

}

private void updateAvgBucketDistance() {
// Centroids are sorted so the average distance is the difference between the first and last.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really interesting observation! It took me a bit of thinking to convince myself that this is equivalent to the previous equation, but I agree that this makes sense.

The next question I had was whether this is the right metric to use after all, since this means that if a bucket is placed between the first and last bucket, its actual location doesn't affect the avgBucketDistance.

But upon further consideration I think it definitely makes sense, since this is still just measuring the distance between buckets.

}

private void updateAvgBucketDistanceIfModified(int modifiedBucketOrd) {
if (modifiedBucketOrd == 0 || modifiedBucketOrd == numClusters - 1) {
Copy link
Contributor

@jamesdorfman jamesdorfman Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct for when a centroid is modified. But numClusters is the denominator of the avgBucketDistance equation. So when a new bucket is added I think the distance should be updated regardless of the bucket's position. In that case maybe you can just call updateAvgBucketDistance() directly?

This makes sense to me intuitively. If a bucket is added within the existing range of buckets, this should decrease the average bucket distance, since there are more buckets in the same range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@jpountz
Copy link
Contributor

jpountz commented Jul 7, 2020

Was it a bug that avgBucketDistance would never get updated before?

@jamesdorfman
Copy link
Contributor

I wouldn't really call it a bug. Initially it was simpler to only calculate avgBucketDistance after bucketing the first initialBuffer documents. As long as initialBuffer was somewhat large, this would give us a reasonable sample of the data spread.

But it definitely makes sense to do it now, since this PR really simplifies updating avgBucketDistance (the older formula required a loop over the buckets).

@nik9000
Copy link
Member Author

nik9000 commented Jul 7, 2020

I think the int for the sum was a bug. Updating the average is probably good, but not really a bug.

@nik9000 nik9000 merged commit 28ca127 into elastic:master Jul 8, 2020
@nik9000
Copy link
Member Author

nik9000 commented Jul 8, 2020

Thanks for reviewing @jpountz and @jamesdorfman!

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jul 8, 2020
This modifies the `variable_width_histogram`'s distant bucket handling
to:
1. Properly handle integer overflows
2. Recalculate the average distance when new buckets are added on the
   ends. This should slow down the rate at which we build extra buckets
   as we build more of them.
nik9000 added a commit that referenced this pull request Jul 9, 2020
This modifies the `variable_width_histogram`'s distant bucket handling
to:
1. Properly handle integer overflows
2. Recalculate the average distance when new buckets are added on the
   ends. This should slow down the rate at which we build extra buckets
   as we build more of them.

Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.9.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants