Skip to content

Conversation

@nik9000
Copy link
Member

@nik9000 nik9000 commented May 27, 2020

When the terms enum operates on non-numeric data it can collect it via
global ordinals. It actually has two separate collection strategies for this,
one "dense" and one "remapping". Each of those strategies has two
"iteration" strategies that it uses to build buckets, depending on
whether or not we need buckets with 0 docs in them. Previously this
was done with several null checks and never really explained. This
change replaces those checks with two CollectionStrategy classes which
have good stuff like Javadocs.

It also adds the name of the strategy used to the debugging information.

Related to #56487

When the `terms` enum operates on non-numeric data it can collect it via
global ordinals. It actually has two separate collection strategies for,
one "dense" and one "remapping". Each of *those* strategies has two
"iteration" strategies that it uses to build buckets, depending on
whether or not we need buckets with `0` docs in them. Previously this
was done with several `null` checks and never really explained. This
change replaces those checks with two `CollectionStrategy` classes which
have good stuff like documentation.
@nik9000
Copy link
Member Author

nik9000 commented May 27, 2020

Ah! I should add a profiler test for the collection_strategy debug information.

@nik9000 nik9000 marked this pull request as ready for review May 28, 2020 13:19
@nik9000 nik9000 requested a review from not-napoleon May 28, 2020 13:19
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 28, 2020
Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, definitely more clear than what we had before.

*/
abstract void globalOrdsReady(SortedSetDocValues globalOrds);
/**
* Collect a global ordinal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this is what gets called by the collector, but I think it'd help to make this javadoc a little more explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

boolean remapGlobalOrds() {
return bucketOrds != null;
this.acceptedGlobalOrdinals = includeExclude == null ? l -> true : includeExclude.acceptedGlobalOrdinals(values)::get;
this.collectionStrategy = remapGlobalOrds ? new RemapGlobalOrds() : new DenseGlobalOrds();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to just pass the collection strategy in directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm - I did think about it but we'd have to do that same "pass a builder" kind of thing because it is a non-static inner class.

}

/**
* Strategy for collecting global ordinals.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your comment from the PR description that we have two collection strategies and each has two iteration strategies would fit well in the javadoc for this class. Or at the very least, a note on forEach that it should account for both iteration cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nik9000
Copy link
Member Author

nik9000 commented May 28, 2020

This looks good, definitely more clear than what we had before.

It took quite a bit of staring for me to figure out what we had before!

@nik9000 nik9000 merged commit 974d236 into elastic:master May 28, 2020
@nik9000
Copy link
Member Author

nik9000 commented May 28, 2020

Thanks @not-napoleon! I've added some javadoc and merged. If you can think of a way to pass the strategy in that you'd prefer please ping me!

nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 28, 2020
When the `terms` enum operates on non-numeric data it can collect it via
global ordinals. It actually has two separate collection strategies for,
one "dense" and one "remapping". Each of *those* strategies has two
"iteration" strategies that it uses to build buckets, depending on
whether or not we need buckets with `0` docs in them. Previously this
was done with several `null` checks and never really explained. This
change replaces those checks with two `CollectionStrategy` classes which
have good stuff like documentation.
nik9000 added a commit that referenced this pull request May 28, 2020
I accidentally didn't put the customary "skip the last version" on
 #57241 and the PR tests didn't catch it. This adds it.
nik9000 added a commit that referenced this pull request May 28, 2020
…7311)

When the `terms` enum operates on non-numeric data it can collect it via
global ordinals. It actually has two separate collection strategies for,
one "dense" and one "remapping". Each of *those* strategies has two
"iteration" strategies that it uses to build buckets, depending on
whether or not we need buckets with `0` docs in them. Previously this
was done with several `null` checks and never really explained. This
change replaces those checks with two `CollectionStrategy` classes which
have good stuff like documentation.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 28, 2020
nik9000 added a commit that referenced this pull request May 29, 2020
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jun 1, 2020
When the `terms` agg runs against strings and uses global ordinals it
has an optimization when it collects segments that only ever have a
single value for the particular string. This is *very* common. But I
broke it in elastic#57241. This fixes that optimization and adds `debug`
information that you can use to see how often we collect segments of
each type. And adds a test to make sure that I don't break the
optimization again.

We also had a specialiation for when there isn't a filter on the terms
to aggregate. I had removed that specialization in elastic#57241 which resulted
in some slow down as well. This adds it back but in a more clear way.
And, hopefully, a way that is marginally faster when there *is* a
filter.

Closes elastic#57407
nik9000 added a commit that referenced this pull request Jun 2, 2020
When the `terms` agg runs against strings and uses global ordinals it
has an optimization when it collects segments that only ever have a
single value for the particular string. This is *very* common. But I
broke it in #57241. This fixes that optimization and adds `debug`
information that you can use to see how often we collect segments of
each type. And adds a test to make sure that I don't break the
optimization again.

We also had a specialiation for when there isn't a filter on the terms
to aggregate. I had removed that specialization in #57241 which resulted
in some slow down as well. This adds it back but in a more clear way.
And, hopefully, a way that is marginally faster when there *is* a
filter.

Closes #57407
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jun 2, 2020
When the `terms` agg runs against strings and uses global ordinals it
has an optimization when it collects segments that only ever have a
single value for the particular string. This is *very* common. But I
broke it in elastic#57241. This fixes that optimization and adds `debug`
information that you can use to see how often we collect segments of
each type. And adds a test to make sure that I don't break the
optimization again.

We also had a specialiation for when there isn't a filter on the terms
to aggregate. I had removed that specialization in elastic#57241 which resulted
in some slow down as well. This adds it back but in a more clear way.
And, hopefully, a way that is marginally faster when there *is* a
filter.

Closes elastic#57407
nik9000 added a commit that referenced this pull request Jun 2, 2020
When the `terms` agg runs against strings and uses global ordinals it
has an optimization when it collects segments that only ever have a
single value for the particular string. This is *very* common. But I
broke it in #57241. This fixes that optimization and adds `debug`
information that you can use to see how often we collect segments of
each type. And adds a test to make sure that I don't break the
optimization again.

We also had a specialiation for when there isn't a filter on the terms
to aggregate. I had removed that specialization in #57241 which resulted
in some slow down as well. This adds it back but in a more clear way.
And, hopefully, a way that is marginally faster when there *is* a
filter.

Closes #57407
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations >refactoring Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.9.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants