Skip to content

Conversation

@jamesdorfman
Copy link
Contributor

@jamesdorfman jamesdorfman commented Jul 10, 2020

When a document which is distant from existing buckets gets collected, the variable_width_histogram will create a new bucket and then insert it into the ordered list of buckets.

Currently, a new merge map array is created to move this bucket. This is very expensive as there might be thousands of buckets.

This PR creates mergeBuckets(UnaryOperator<Long> mergeMap) methods in BucketsAggregator and MergingBucketsDefferingCollector, and updates the variable_width_histogram to use them. This eliminates the need to create an entire merge map array for each new bucket and reduces the memory overhead of the algorithm.

@jamesdorfman
Copy link
Contributor Author

@nik9000 I recall you left a comment on #59094 mentioning a solution like this. Let me know if you had something different in mind!

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jul 12, 2020
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is what I was thinking of!

I left a few comments, the one that is really important:

  1. Could you replace UnaryOperator<Long> with LongUnaryOperator? That'll prevent auto-boxing of the parameter.

Another comment: would you be up for writing a test for these two new methods? We don't have any tests for any of the methods around them which is a sad thing. The changes that you made look right, but I'm worried that one day we'll accidentally break things in sneaky ways if we don't have a test. I mostly write tests because I don't trust future-@nik9000 to remember things now-@nik9000 was thinking.

}
};

mergeBuckets(mergeMapOperator, newNumBuckets);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could write this:

mergeBuckets(newNumbBuckets, buck -> mergeMap[Math.toIntExact(bucket)]);

* This only tidies up doc counts. Call {@link MergingBucketsDeferringCollector#mergeBuckets(UnaryOperator)} to
* merge the actual ordinals and doc ID deltas.
*/
public final void mergeBuckets(UnaryOperator<Long> mergeMap, long newNumBuckets){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things, only one is important though:

  1. Could you replace UnaryOperator<Long> with LongUnaryOperator? That'll prevent auto-boxing of the parameter.
  2. Could you switch the order of the arguments? I think the call sites are a little prettier when the "function" argument is last, if possible.

try (IntArray oldDocCounts = docCounts) {
docCounts = bigArrays.newIntArray(newNumBuckets, true);
docCounts.fill(0, newNumBuckets, 0);
for (int i = 0; i < oldDocCounts.size(); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you swap to long i?

}

/**
* This only tidies up doc counts. Call {@link MergingBucketsDeferringCollector#mergeBuckets(UnaryOperator)} to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth saying what the unary operator does here, that -1 means throw away and otherwise it is the destination index.

return i + 1;
}
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have tried to write this:

mergeBuckets(numClusters, bucket -> {
  if (i < index) {
    // The clusters in range {0 ... idx - 1} don't move
    return 1;
  }
  if (i == numClusters - 1) {
    // The new cluster moves to index
    return i;
  }
  // The clusters in range {index ... numClusters - 1} shift forward
  return i = 1;
});

I like the "inline function declaration" form of this because it makes it super obvious that it doesn't escape.

I also like early return instead of else if, but that is totally up to you. Its a matter of style and we don't have a standard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this would work, since this same function is used in the calls to bothBucketsAggregator::mergeBuckets and MergingBucketsDeferringCollector::mergeBuckets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Well, what you have is just fine too.

@jamesdorfman
Copy link
Contributor Author

Thanks for the feedback @nik9000 :)

I made all the changes you requested - mainly I updated the UnaryOperator to a LongUnaryOperator and I marked the methods that take a merge map as an array as deprecated.

I also added some test for both of the mergeBuckets methods. Sorry for the delay in addressing your comments. It took me a long time to figure out how to test MergingBucketsDeferringCollector::mergeBuckets. I'm not sure if I did this in the proper way, please let me know!

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me! I'll have a closer look at the test tomorrow.

Elasticmachine, ok to test.

return i + 1;
}
}
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Well, what you have is just fine too.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment about one of the tests, but I think everything else is great! That test is just using Query in a rather strange way, and I think it'd be clearer to wrap the collector.

* Usually all documents get collected into ordinal 0 unless they are part of a sub aggregation
* @return a query that collects the i'th document into bucket ordinal i
*/
private Query getQueryToCollectIntoDifferentOrdinals() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be a little cleaner to do this by wrapping the collector that you pass to indexSearcher.search and just us MatchAllDocsQuery instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! I overrode one of the methods in MergingBucketsDeferringCollector and now it is a lot cleaner. Is this what you had in mind?

I'm not sure if wrapping the bucket collector directly would work, since the deferring collector stores the bucket ordinal before it calls the bucket collector's collect method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't quite what I was thinking. I think part of the problem is that I'm sort of stuck in a "do it like aggs do" mindset. In that mindset there are two collectors and the MergingBucketsDeferringCollector. One collector emulates the outer aggregator and calls to the MergingBucketsDeferringCollector's collect method, calling merge in some funny shape. And the other bucket emulates the inner aggregation and is just called by the aggregator.

I've got half of a patch on my laptop that does this but I'm probably going to stop for the day. I'll try and post it this weekend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something like this. There is a collector that just counts and a collector the distributes bucket ords and merges. And the MergingBucketsDeferringCollector sits between them.

Two neat things:

  1. I think I found a bug! I don't think it actually comes up in production because you have to throw away buckets using the merge method while collecting buckets and I think we only do that in the rare_terms aggregators and they only merge after all the collections are done.
  2. My test only really covers the variable-width histogram style of merging, not the rare_terms style. But I think that is pretty ok. Its is the harder style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my delayed response, I had a very hectic start of week!

Very cool, that is definitely a lot cleaner, thanks for the patch! I've added it and filed a corresponding bug.
That's a strange bug by the way...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my delayed response, I had a very hectic start of week!

Its cool! Thanks for getting back to me!

That's a strange bug by the way...

It's sneaky! I'm hoping we really don't actually hit it. But it is the kind of thing that happens without unit tests, I think.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@nik9000
Copy link
Member

nik9000 commented Jul 23, 2020

@elasticmachine, ok to test

@nik9000
Copy link
Member

nik9000 commented Jul 23, 2020

run elasticsearch-ci/bwc

@nik9000
Copy link
Member

nik9000 commented Jul 24, 2020

@elasticmachine update branch

@nik9000
Copy link
Member

nik9000 commented Jul 24, 2020

Let's see if that gets it unwedged.

@nik9000 nik9000 merged commit 8b7c556 into elastic:master Jul 24, 2020
@nik9000
Copy link
Member

nik9000 commented Jul 24, 2020

That did it! I've merged and will backport to 7.x.

nik9000 pushed a commit to nik9000/elasticsearch that referenced this pull request Jul 24, 2020
…egation (elastic#59366)

When a document which is distant from existing buckets gets collected, the
`variable_width_histogram` will create a new bucket and then insert it
into the ordered list of buckets.

Currently, a new merge map array is created to move this bucket. This is
very expensive as there might be thousands of buckets.

This PR creates `mergeBuckets(UnaryOperator<Long> mergeMap)` methods in
`BucketsAggregator` and `MergingBucketsDefferingCollector`, and updates
the `variable_width_histogram`  to use them. This eliminates the need to
create an entire merge map array for each new bucket and reduces the
memory overhead of the algorithm.
@jamesdorfman
Copy link
Contributor Author

Great, sounds good :) Thanks for reviewing!!!

nik9000 added a commit that referenced this pull request Jul 27, 2020
…egation (#59366) (#60171)

When a document which is distant from existing buckets gets collected, the
`variable_width_histogram` will create a new bucket and then insert it
into the ordered list of buckets.

Currently, a new merge map array is created to move this bucket. This is
very expensive as there might be thousands of buckets.

This PR creates `mergeBuckets(UnaryOperator<Long> mergeMap)` methods in
`BucketsAggregator` and `MergingBucketsDefferingCollector`, and updates
the `variable_width_histogram`  to use them. This eliminates the need to
create an entire merge map array for each new bucket and reduces the
memory overhead of the algorithm.

Co-authored-by: James Dorfman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations >refactoring Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.9.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants