Skip to content

Conversation

@hendrikmuhs
Copy link

@hendrikmuhs hendrikmuhs commented Jan 25, 2022

The PR adds an aggregation called frequent_items, a bucket aggregation which finds frequent item sets. It is a form of association rules mining that identifies items that often occur together. It also helps you to discover relationships between different data points (items).

For more information about usage have a look at #86037.

This implements frequent items using an algorithm called eclat.

@hendrikmuhs hendrikmuhs added WIP cloud-deploy Publish cloud docker image for Cloud-First-Testing labels Jan 25, 2022
@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
@hendrikmuhs hendrikmuhs force-pushed the frequent-items branch 2 times, most recently from 95f1012 to 310ba93 Compare March 8, 2022 11:36
@hendrikmuhs hendrikmuhs force-pushed the frequent-items branch 2 times, most recently from 0cd37e4 to 882b95f Compare March 15, 2022 11:54
@hendrikmuhs hendrikmuhs force-pushed the frequent-items branch 2 times, most recently from ccce27c to 88aef2e Compare March 24, 2022 11:28
@hendrikmuhs hendrikmuhs force-pushed the frequent-items branch 2 times, most recently from 73af00f to d720bd8 Compare April 13, 2022 16:32
@hendrikmuhs
Copy link
Author

@elasticmachine run elasticsearch-ci/part-1

@hendrikmuhs hendrikmuhs force-pushed the frequent-items branch 4 times, most recently from 471c138 to 796bf52 Compare April 27, 2022 15:24
@hendrikmuhs hendrikmuhs added >enhancement :ml Machine learning and removed WIP labels Apr 29, 2022
@hendrikmuhs hendrikmuhs marked this pull request as ready for review April 29, 2022 06:56
@not-napoleon
Copy link
Member

I re-based the branch, a lot of tests fail otherwise.

Please don't do that during the review process, as it makes it much harder to follow the actual changes. Could you merge instead? That would make it easier to review incrementally. Our development process actually calls that out specifically.

protected void doClose() {
// disconnect the aggregation context circuit breaker, so big arrays used in results can be passed
if (breakerService != null) {
breakerService.disconnect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I understand why we're not returning the bytes from the map reduce context to the CB here, but I don't see where they are being returned. InternalMapReduceAggregation talks about how we manage the memory around deserialization, but I don't see where the ones created data node side release the memory they're holding from here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Aggregator side the AggContext takes care of circuit breaking, the bytes get reported to the AggContext and the AggContext returns the allocation to the breaker in AggContext::close(). That means the bytes are returned before the actual object is destroyed. Between the AggContext getting closed and the internal agg getting send off, the object is allocated without proper accounting by a circuit breaker. That's the first problem described in #88128.

On the coordinating side the memory gets accounted in QueryPhaseResultConsumer::estimateRamBytesUsedForReduce, that's the infamous 1.5 magic. That means on the reduce side we simply don't do any CB magic and assume it fits in the budget. QueryPhaseResultConsumer allocates the budget and releases it for us. Most probably it over-allocates, problems 2-5 listed in #88128 talk about that part.

LBNL the big arrays aren't using recycling. I don't think I can guarantee returning the buffers properly in all error scenarios I can think of.

@not-napoleon
Copy link
Member

Have we done any performance testing on this? Do we have a sense of the performance requirements and scale we expect it to run at? Will we be adding it to any of the nightly benchmarks?

@hendrikmuhs
Copy link
Author

hendrikmuhs commented Jul 12, 2022

Have we done any performance testing on this?

We have more than 30 test data sets and challenges which we use(d) for testing. We compare the agg with an opensource implementation of eclat. In future we will use this test suite as baseline for enhancements.

I will share the data offline as it's non-public. I've also showed it briefly in the meeting we had.

Do we have a sense of the performance requirements and scale we expect it to run at?

We have further plans for this agg. For 8.4 we want to go API 1st. An UI prototype runs multiple asynchronous searches before it renders the result. We are satisfied with the runtime today, however faster is of course better. As discussed in #88128 this is not only about this agg, but also regarding other concurrent searches.

This agg supports random sampling which we will apply at least on bigger data sets. With sampling we can trade speed for some loss of precision.

Will we be adding it to any of the nightly benchmarks?

We will run bigger testing as part of our QA framework. In addition I am looking into a rally challenge, which might result in a nightly benchmark.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:09
@hendrikmuhs
Copy link
Author

@elasticmachine update branch

@hendrikmuhs
Copy link
Author

hendrikmuhs commented Jul 25, 2022

I changed naming as discussed @not-napoleon.

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the map reduce logic less visible. With that done, I think we can merge this. I still have some concerns about the performance at scale, but in the spirit of "Progress, simple, perfection", I think we can move ahead with that and adjust as we see usage.

With regards to the circuit breaker & memory tracking, I think we're all on the same page that we have work to do, but we don't want to block this PR on that work. The aggs team is going to start work on some infrastructure to address these concerns, with the understanding that once we have that we need to revisit this aggregation and use those new tools.

Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

Thanks for pushing this to completion @hendrikmuhs and thanks for all the time you spent reviewing @not-napoleon.

@hendrikmuhs hendrikmuhs merged commit a0ba59d into elastic:main Jul 26, 2022
@elasticsearchmachine
Copy link
Collaborator

@hendrikmuhs according to this PR's labels, I need to update the changelog YAML, but I can't because the PR is closed. Please either update the changelog yourself on the appropriate branch, or adjust the labels. Specifically:

  • The PR is labelled release highlight but the changelog has no highlight section

@droberts195
Copy link

Since this aggregation is experimental and primarily designed for our own UIs to use we've decided not to call it out in the release highlights for 8.4. Instead it will just get the single bullet point release note.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cloud-deploy Publish cloud docker image for Cloud-First-Testing >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants