-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML] add a frequent items aggregation #83055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
26b3541 to
714d478
Compare
1cfc196 to
a9bfc45
Compare
a9bfc45 to
dfed8a1
Compare
95f1012 to
310ba93
Compare
0cd37e4 to
882b95f
Compare
ccce27c to
88aef2e
Compare
73af00f to
d720bd8
Compare
|
@elasticmachine run elasticsearch-ci/part-1 |
471c138 to
796bf52
Compare
892df96 to
b7ac78a
Compare
.../plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/mapreduce/MapReduceValueSource.java
Outdated
Show resolved
Hide resolved
Please don't do that during the review process, as it makes it much harder to follow the actual changes. Could you merge instead? That would make it easier to review incrementally. Our development process actually calls that out specifically. |
...rc/main/java/org/elasticsearch/xpack/ml/aggs/frequentitemsets/HashBasedTransactionStore.java
Show resolved
Hide resolved
| protected void doClose() { | ||
| // disconnect the aggregation context circuit breaker, so big arrays used in results can be passed | ||
| if (breakerService != null) { | ||
| breakerService.disconnect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I understand why we're not returning the bytes from the map reduce context to the CB here, but I don't see where they are being returned. InternalMapReduceAggregation talks about how we manage the memory around deserialization, but I don't see where the ones created data node side release the memory they're holding from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Aggregator side the AggContext takes care of circuit breaking, the bytes get reported to the AggContext and the AggContext returns the allocation to the breaker in AggContext::close(). That means the bytes are returned before the actual object is destroyed. Between the AggContext getting closed and the internal agg getting send off, the object is allocated without proper accounting by a circuit breaker. That's the first problem described in #88128.
On the coordinating side the memory gets accounted in QueryPhaseResultConsumer::estimateRamBytesUsedForReduce, that's the infamous 1.5 magic. That means on the reduce side we simply don't do any CB magic and assume it fits in the budget. QueryPhaseResultConsumer allocates the budget and releases it for us. Most probably it over-allocates, problems 2-5 listed in #88128 talk about that part.
LBNL the big arrays aren't using recycling. I don't think I can guarantee returning the buffers properly in all error scenarios I can think of.
|
Have we done any performance testing on this? Do we have a sense of the performance requirements and scale we expect it to run at? Will we be adding it to any of the nightly benchmarks? |
We have more than 30 test data sets and challenges which we use(d) for testing. We compare the agg with an opensource implementation of eclat. In future we will use this test suite as baseline for enhancements. I will share the data offline as it's non-public. I've also showed it briefly in the meeting we had.
We have further plans for this agg. For 8.4 we want to go API 1st. An UI prototype runs multiple asynchronous searches before it renders the result. We are satisfied with the runtime today, however faster is of course better. As discussed in #88128 this is not only about this agg, but also regarding other concurrent searches. This agg supports random sampling which we will apply at least on bigger data sets. With sampling we can trade speed for some loss of precision.
We will run bigger testing as part of our QA framework. In addition I am looking into a rally challenge, which might result in a nightly benchmark. |
|
@elasticmachine update branch |
|
I changed naming as discussed @not-napoleon. |
not-napoleon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the map reduce logic less visible. With that done, I think we can merge this. I still have some concerns about the performance at scale, but in the spirit of "Progress, simple, perfection", I think we can move ahead with that and adjust as we see usage.
With regards to the circuit breaker & memory tracking, I think we're all on the same page that we have work to do, but we don't want to block this PR on that work. The aggs team is going to start work on some infrastructure to address these concerns, with the understanding that once we have that we need to revisit this aggregation and use those new tools.
droberts195
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
Thanks for pushing this to completion @hendrikmuhs and thanks for all the time you spent reviewing @not-napoleon.
|
@hendrikmuhs according to this PR's labels, I need to update the changelog YAML, but I can't because the PR is closed. Please either update the changelog yourself on the appropriate branch, or adjust the labels. Specifically:
|
|
Since this aggregation is experimental and primarily designed for our own UIs to use we've decided not to call it out in the release highlights for 8.4. Instead it will just get the single bullet point release note. |
The PR adds an aggregation called
frequent_items, a bucket aggregation which finds frequent item sets. It is a form of association rules mining that identifies items that often occur together. It also helps you to discover relationships between different data points (items).For more information about usage have a look at #86037.
This implements frequent items using an algorithm called eclat.