Skip to content

Revisit defaults for the cardinality aggregation? #13985

@jpountz

Description

@jpountz

The precision_threshold parameter of the cardinality aggregation not only has an impact on accuracy but also on memory usage. This is why by default we decide how much memory a cardinality aggregation may use depending on how deep it can be found in the aggregation tree. For instance a top-level cardinality aggregation would use 16KB of memory, a cardinality aggregation under a terms aggregation would use 512 bytes per bucket, and a cardinality aggregation under two (or more) levels of terms aggregation would use 16 bytes per bucket.

Unfortunately, it's not easy to get precise counts with only 16 bytes of memory, which can make the out-of-the-box experience a bit disappointing. I think we have several (non-exclusive) options here:

  • increase default memory usage, but I'm nervous about making it even easier to trigger circuit-breaking errors or worse out-of-memory errors. Maybe Define good heuristics to use collect_mode: breadth_first #9825 could help here: we could decide to always run terms aggs in breadth-first mode if there is a cardinality agg under them so that the cardinality aggregation would be computed on fewer buckets
  • better document these defaults
  • move parts of the aggs computation to disk so that we can increase our defaults more safely

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions