[ML] add a frequent items aggregation #83055

hendrikmuhs · 2022-01-25T15:01:52Z

The PR adds an aggregation called frequent_items, a bucket aggregation which finds frequent item sets. It is a form of association rules mining that identifies items that often occur together. It also helps you to discover relationships between different data points (items).

For more information about usage have a look at #86037.

This implements frequent items using an algorithm called eclat.

hendrikmuhs · 2022-04-13T17:38:03Z

@elasticmachine run elasticsearch-ci/part-1

.../plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/mapreduce/MapReduceValueSource.java

not-napoleon · 2022-07-06T14:49:46Z

I re-based the branch, a lot of tests fail otherwise.

Please don't do that during the review process, as it makes it much harder to follow the actual changes. Could you merge instead? That would make it easier to review incrementally. Our development process actually calls that out specifically.

...rc/main/java/org/elasticsearch/xpack/ml/aggs/frequentitemsets/HashBasedTransactionStore.java

not-napoleon · 2022-07-07T16:00:31Z

...k/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/mapreduce/MapReduceAggregator.java

+    protected void doClose() {
+        // disconnect the aggregation context circuit breaker, so big arrays used in results can be passed
+        if (breakerService != null) {
+            breakerService.disconnect();


Okay. I understand why we're not returning the bytes from the map reduce context to the CB here, but I don't see where they are being returned. InternalMapReduceAggregation talks about how we manage the memory around deserialization, but I don't see where the ones created data node side release the memory they're holding from here.

On the Aggregator side the AggContext takes care of circuit breaking, the bytes get reported to the AggContext and the AggContext returns the allocation to the breaker in AggContext::close(). That means the bytes are returned before the actual object is destroyed. Between the AggContext getting closed and the internal agg getting send off, the object is allocated without proper accounting by a circuit breaker. That's the first problem described in #88128.

On the coordinating side the memory gets accounted in QueryPhaseResultConsumer::estimateRamBytesUsedForReduce, that's the infamous 1.5 magic. That means on the reduce side we simply don't do any CB magic and assume it fits in the budget. QueryPhaseResultConsumer allocates the budget and releases it for us. Most probably it over-allocates, problems 2-5 listed in #88128 talk about that part.

LBNL the big arrays aren't using recycling. I don't think I can guarantee returning the buffers properly in all error scenarios I can think of.

not-napoleon · 2022-07-11T16:09:38Z

Have we done any performance testing on this? Do we have a sense of the performance requirements and scale we expect it to run at? Will we be adding it to any of the nightly benchmarks?

hendrikmuhs · 2022-07-12T08:19:03Z

Have we done any performance testing on this?

We have more than 30 test data sets and challenges which we use(d) for testing. We compare the agg with an opensource implementation of eclat. In future we will use this test suite as baseline for enhancements.

I will share the data offline as it's non-public. I've also showed it briefly in the meeting we had.

Do we have a sense of the performance requirements and scale we expect it to run at?

We have further plans for this agg. For 8.4 we want to go API 1st. An UI prototype runs multiple asynchronous searches before it renders the result. We are satisfied with the runtime today, however faster is of course better. As discussed in #88128 this is not only about this agg, but also regarding other concurrent searches.

This agg supports random sampling which we will apply at least on bigger data sets. With sampling we can trade speed for some loss of precision.

Will we be adding it to any of the nightly benchmarks?

We will run bigger testing as part of our QA framework. In addition I am looking into a rally challenge, which might result in a nightly benchmark.

hendrikmuhs · 2022-07-25T09:58:12Z

@elasticmachine update branch

hendrikmuhs · 2022-07-25T11:28:37Z

I changed naming as discussed @not-napoleon.

not-napoleon

Thanks for making the map reduce logic less visible. With that done, I think we can merge this. I still have some concerns about the performance at scale, but in the spirit of "Progress, simple, perfection", I think we can move ahead with that and adjust as we see usage.

With regards to the circuit breaker & memory tracking, I think we're all on the same page that we have work to do, but we don't want to block this PR on that work. The aggs team is going to start work on some infrastructure to address these concerns, with the understanding that once we have that we need to revisit this aggregation and use those new tools.

droberts195

LGTM2

Thanks for pushing this to completion @hendrikmuhs and thanks for all the time you spent reviewing @not-napoleon.

elasticsearchmachine · 2022-08-03T08:55:32Z

@hendrikmuhs according to this PR's labels, I need to update the changelog YAML, but I can't because the PR is closed. Please either update the changelog yourself on the appropriate branch, or adjust the labels. Specifically:

The PR is labelled release highlight but the changelog has no highlight section

droberts195 · 2022-08-16T08:48:55Z

Since this aggregation is experimental and primarily designed for our own UIs to use we've decided not to call it out in the release highlights for 8.4. Instead it will just get the single bullet point release note.

hendrikmuhs added WIP cloud-deploy Publish cloud docker image for Cloud-First-Testing labels Jan 25, 2022

elasticsearchmachine added the v8.1.0 label Jan 25, 2022

hendrikmuhs force-pushed the frequent-items branch from 26b3541 to 714d478 Compare January 25, 2022 15:18

hendrikmuhs force-pushed the frequent-items branch from 1cfc196 to a9bfc45 Compare February 1, 2022 16:20

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

hendrikmuhs force-pushed the frequent-items branch from a9bfc45 to dfed8a1 Compare February 15, 2022 21:31

hendrikmuhs force-pushed the frequent-items branch 2 times, most recently from 95f1012 to 310ba93 Compare March 8, 2022 11:36

hendrikmuhs force-pushed the frequent-items branch 2 times, most recently from 0cd37e4 to 882b95f Compare March 15, 2022 11:54

hendrikmuhs force-pushed the frequent-items branch 2 times, most recently from ccce27c to 88aef2e Compare March 24, 2022 11:28

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

hendrikmuhs mentioned this pull request Apr 1, 2022

Implement a serializable version of a BytesRefHash #85630

Closed

hendrikmuhs mentioned this pull request Apr 12, 2022

refactor array part into a BytesRefArray which can be serialized and … #85826

Merged

hendrikmuhs force-pushed the frequent-items branch 2 times, most recently from 73af00f to d720bd8 Compare April 13, 2022 16:32

hendrikmuhs force-pushed the frequent-items branch 4 times, most recently from 471c138 to 796bf52 Compare April 27, 2022 15:24

hendrikmuhs force-pushed the frequent-items branch from 892df96 to b7ac78a Compare April 29, 2022 06:36

hendrikmuhs added >enhancement :ml Machine learning and removed WIP labels Apr 29, 2022

hendrikmuhs marked this pull request as ready for review April 29, 2022 06:56

not-napoleon reviewed Jul 6, 2022

View reviewed changes

.../plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/mapreduce/MapReduceValueSource.java Outdated Show resolved Hide resolved

Hendrik Muhs added 2 commits July 6, 2022 16:21

add privilege to access test index

dd05796

add sub-agg yaml test

1ced4c2

not-napoleon reviewed Jul 7, 2022

View reviewed changes

...rc/main/java/org/elasticsearch/xpack/ml/aggs/frequentitemsets/HashBasedTransactionStore.java Show resolved Hide resolved

not-napoleon reviewed Jul 7, 2022

View reviewed changes

Hendrik Muhs added 5 commits July 7, 2022 21:34

rewrite formatting of values to happen when creating the results

aab61d3

add note about exception handling

c828f11

add some more tests for MapReduceValueSource

7c35c23

add a multi-bucket testcase

5639cdf

fix test case with dates

73ec078

hendrikmuhs mentioned this pull request Jul 12, 2022

[ML] Frequent items Use bitsets for faster de-duplication #88463

Closed

add memory usage to profile output

e727a7e

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:09

Hendrik Muhs added 2 commits July 25, 2022 11:46

rename classes and packages

0d78b1a

checkstyle and spotless

1893063

elasticmachine and others added 2 commits July 25, 2022 19:28

Merge branch 'main' into frequent-items

0473a7b

adapt to upstream change

53d9b6a

not-napoleon approved these changes Jul 25, 2022

View reviewed changes

hendrikmuhs mentioned this pull request Jul 26, 2022

add a frequent-items challenge to the stack overflow track elastic/rally-tracks#284

Merged

droberts195 approved these changes Jul 26, 2022

View reviewed changes

hendrikmuhs merged commit a0ba59d into elastic:main Jul 26, 2022

szabosteve added the release highlight label Aug 3, 2022

droberts195 removed the release highlight label Aug 16, 2022

[ML] add a frequent items aggregation #83055

[ML] add a frequent items aggregation #83055

Uh oh!

Conversation

hendrikmuhs commented Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hendrikmuhs commented Apr 13, 2022

Uh oh!

Uh oh!

not-napoleon commented Jul 6, 2022

Uh oh!

Uh oh!

not-napoleon Jul 7, 2022

Choose a reason for hiding this comment

Uh oh!

hendrikmuhs Jul 7, 2022

Choose a reason for hiding this comment

Uh oh!

not-napoleon commented Jul 11, 2022

Uh oh!

hendrikmuhs commented Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hendrikmuhs commented Jul 25, 2022

Uh oh!

hendrikmuhs commented Jul 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

not-napoleon left a comment

Choose a reason for hiding this comment

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Aug 3, 2022

Uh oh!

droberts195 commented Aug 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

hendrikmuhs commented Jan 25, 2022 •

edited

Loading

hendrikmuhs commented Jul 12, 2022 •

edited

Loading

hendrikmuhs commented Jul 25, 2022 •

edited

Loading