Skip to content

Commit 326fe75

Browse files
authored
New Histogram field mapper that supports percentiles aggregations. (#48580) (#49683)
This commit adds a new histogram field mapper that consists in a pre-aggregated format of numerical data to be used in percentiles aggregations.
1 parent 04e9cbd commit 326fe75

32 files changed

+2127
-72
lines changed

docs/reference/aggregations/metrics/percentile-aggregation.asciidoc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
=== Percentiles Aggregation
33

44
A `multi-value` metrics aggregation that calculates one or more percentiles
5-
over numeric values extracted from the aggregated documents. These values
6-
can be extracted either from specific numeric fields in the documents, or
7-
be generated by a provided script.
5+
over numeric values extracted from the aggregated documents. These values can be
6+
generated by a provided script or extracted from specific numeric or
7+
<<histogram,histogram fields>> in the documents.
88

99
Percentiles show the point at which a certain percentage of observed values
1010
occur. For example, the 95th percentile is the value which is greater than 95%

docs/reference/aggregations/metrics/percentile-rank-aggregation.asciidoc

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
=== Percentile Ranks Aggregation
33

44
A `multi-value` metrics aggregation that calculates one or more percentile ranks
5-
over numeric values extracted from the aggregated documents. These values
6-
can be extracted either from specific numeric fields in the documents, or
7-
be generated by a provided script.
5+
over numeric values extracted from the aggregated documents. These values can be
6+
generated by a provided script or extracted from specific numeric or
7+
<<histogram,histogram fields>> in the documents.
88

99
[NOTE]
1010
==================================================

docs/reference/mapping/types.asciidoc

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ string:: <<text,`text`>> and <<keyword,`keyword`>>
3232
<<ip>>:: `ip` for IPv4 and IPv6 addresses
3333
<<completion-suggester,Completion datatype>>::
3434
`completion` to provide auto-complete suggestions
35+
3536
<<token-count>>:: `token_count` to count the number of tokens in a string
3637
{plugins}/mapper-murmur3.html[`mapper-murmur3`]:: `murmur3` to compute hashes of values at index-time and store them in the index
3738
{plugins}/mapper-annotated-text.html[`mapper-annotated-text`]:: `annotated-text` to index text containing special markup (typically used for identifying named entities)
@@ -56,6 +57,8 @@ string:: <<text,`text`>> and <<keyword,`keyword`>>
5657

5758
<<shape>>:: `shape` for arbitrary cartesian geometries.
5859

60+
<<histogram>>:: `histogram` for pre-aggregated numerical values for percentiles aggregations.
61+
5962
[float]
6063
[[types-array-handling]]
6164
=== Arrays
@@ -91,6 +94,8 @@ include::types/date_nanos.asciidoc[]
9194

9295
include::types/dense-vector.asciidoc[]
9396

97+
include::types/histogram.asciidoc[]
98+
9499
include::types/flattened.asciidoc[]
95100

96101
include::types/geo-point.asciidoc[]
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
[role="xpack"]
2+
[testenv="basic"]
3+
[[histogram]]
4+
=== Histogram datatype
5+
++++
6+
<titleabbrev>Histogram</titleabbrev>
7+
++++
8+
9+
A field to store pre-aggregated numerical data representing a histogram.
10+
This data is defined using two paired arrays:
11+
12+
* A `values` array of <<number, `double`>> numbers, representing the buckets for
13+
the histogram. These values must be provided in ascending order.
14+
* A corresponding `counts` array of <<number, `integer`>> numbers, representing how
15+
many values fall into each bucket. These numbers must be positive or zero.
16+
17+
Because the elements in the `values` array correspond to the elements in the
18+
same position of the `count` array, these two arrays must have the same length.
19+
20+
[IMPORTANT]
21+
========
22+
* A `histogram` field can only store a single pair of `values` and `count` arrays
23+
per document. Nested arrays are not supported.
24+
* `histogram` fields do not support sorting.
25+
========
26+
27+
[[histogram-uses]]
28+
==== Uses
29+
30+
`histogram` fields are primarily intended for use with aggregations. To make it
31+
more readily accessible for aggregations, `histogram` field data is stored as a
32+
binary <<doc-values,doc values>> and not indexed. Its size in bytes is at most
33+
`13 * numValues`, where `numValues` is the length of the provided arrays.
34+
35+
Because the data is not indexed, you only can use `histogram` fields for the
36+
following aggregations and queries:
37+
38+
* <<search-aggregations-metrics-percentile-aggregation,percentiles>> aggregation
39+
* <<search-aggregations-metrics-percentile-rank-aggregation,percentile ranks>> aggregation
40+
* <<query-dsl-exists-query,exists>> query
41+
42+
[[mapping-types-histogram-building-histogram]]
43+
==== Building a histogram
44+
45+
When using a histogram as part of an aggregation, the accuracy of the results will depend on how the
46+
histogram was constructed. It is important to consider the percentiles aggregation mode that will be used
47+
to build it. Some possibilities include:
48+
49+
- For the <<search-aggregations-metrics-percentile-aggregation, T-Digest>> mode, the `values` array represents
50+
the mean centroid positions and the `counts` array represents the number of values that are attributed to each
51+
centroid. If the algorithm has already started to approximate the percentiles, this inaccuracy is
52+
carried over in the histogram.
53+
54+
- For the <<_hdr_histogram,High Dynamic Range (HDR)>> histogram mode, the `values` array represents fixed upper
55+
limits of each bucket interval, and the `counts` array represents the number of values that are attributed to each
56+
interval. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits),
57+
therefore the value used when generating the histogram would be the maximum accuracy you can achieve at aggregation time.
58+
59+
The histogram field is "algorithm agnostic" and does not store data specific to either T-Digest or HDRHistogram. While this
60+
means the field can technically be aggregated with either algorithm, in practice the user should chose one algorithm and
61+
index data in that manner (e.g. centroids for T-Digest or intervals for HDRHistogram) to ensure best accuracy.
62+
63+
[[histogram-ex]]
64+
==== Examples
65+
66+
The following <<indices-create-index, create index>> API request creates a new index with two field mappings:
67+
68+
* `my_histogram`, a `histogram` field used to store percentile data
69+
* `my_text`, a `keyword` field used to store a title for the histogram
70+
71+
[ INSERT CREATE INDEX SNIPPET ]
72+
[source,console]
73+
--------------------------------------------------
74+
PUT my_index
75+
{
76+
"mappings": {
77+
"properties": {
78+
"my_histogram": {
79+
"type" : "histogram"
80+
},
81+
"my_text" : {
82+
"type" : "keyword"
83+
}
84+
}
85+
}
86+
}
87+
--------------------------------------------------
88+
89+
The following <<docs-index_,index>> API requests store pre-aggregated for
90+
two histograms: `histogram_1` and `histogram_2`.
91+
92+
[source,console]
93+
--------------------------------------------------
94+
PUT my_index/_doc/1
95+
{
96+
"my_text" : "histogram_1",
97+
"my_histogram" : {
98+
"values" : [0.1, 0.2, 0.3, 0.4, 0.5], <1>
99+
"counts" : [3, 7, 23, 12, 6] <2>
100+
}
101+
}
102+
103+
PUT my_index/_doc/2
104+
{
105+
"my_text" : "histogram_2",
106+
"my_histogram" : {
107+
"values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5], <1>
108+
"counts" : [8, 17, 8, 7, 6, 2] <2>
109+
}
110+
}
111+
--------------------------------------------------
112+
<1> Values for each bucket. Values in the array are treated as doubles and must be given in
113+
increasing order. For <<search-aggregations-metrics-percentile-aggregation-approximation, T-Digest>>
114+
histograms this value represents the mean value. In case of HDR histograms this represents the value iterated to.
115+
<2> Count for each bucket. Values in the arrays are treated as integers and must be positive or zero.
116+
Negative values will be rejected. The relation between a bucket and a count is given by the position in the array.
117+
118+
119+
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
/*
2+
* Licensed to Elasticsearch under one or more contributor
3+
* license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright
5+
* ownership. Elasticsearch licenses this file to you under
6+
* the Apache License, Version 2.0 (the "License"); you may
7+
* not use this file except in compliance with the License.
8+
* You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
package org.elasticsearch.index.fielddata;
20+
21+
22+
import java.io.IOException;
23+
24+
/**
25+
* {@link AtomicFieldData} specialization for histogram data.
26+
*/
27+
public interface AtomicHistogramFieldData extends AtomicFieldData {
28+
29+
/**
30+
* Return Histogram values.
31+
*/
32+
HistogramValues getHistogramValues() throws IOException;
33+
34+
}
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
/*
2+
* Licensed to Elasticsearch under one or more contributor
3+
* license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright
5+
* ownership. Elasticsearch licenses this file to you under
6+
* the Apache License, Version 2.0 (the "License"); you may
7+
* not use this file except in compliance with the License.
8+
* You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
20+
package org.elasticsearch.index.fielddata;
21+
22+
import java.io.IOException;
23+
24+
/**
25+
* Per-document histogram value. Every value of the histogram consist on
26+
* a value and a count.
27+
*/
28+
public abstract class HistogramValue {
29+
30+
/**
31+
* Advance this instance to the next value of the histogram
32+
* @return true if there is a next value
33+
*/
34+
public abstract boolean next() throws IOException;
35+
36+
/**
37+
* the current value of the histogram
38+
* @return the current value of the histogram
39+
*/
40+
public abstract double value();
41+
42+
/**
43+
* The current count of the histogram
44+
* @return the current count of the histogram
45+
*/
46+
public abstract int count();
47+
48+
}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
/*
2+
* Licensed to Elasticsearch under one or more contributor
3+
* license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright
5+
* ownership. Elasticsearch licenses this file to you under
6+
* the Apache License, Version 2.0 (the "License"); you may
7+
* not use this file except in compliance with the License.
8+
* You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
20+
package org.elasticsearch.index.fielddata;
21+
22+
import java.io.IOException;
23+
24+
/**
25+
* Per-segment histogram values.
26+
*/
27+
public abstract class HistogramValues {
28+
29+
/**
30+
* Advance this instance to the given document id
31+
* @return true if there is a value for this document
32+
*/
33+
public abstract boolean advanceExact(int doc) throws IOException;
34+
35+
/**
36+
* Get the {@link HistogramValue} associated with the current document.
37+
* The returned {@link HistogramValue} might be reused across calls.
38+
*/
39+
public abstract HistogramValue histogram() throws IOException;
40+
41+
}
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
/*
2+
* Licensed to Elasticsearch under one or more contributor
3+
* license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright
5+
* ownership. Elasticsearch licenses this file to you under
6+
* the Apache License, Version 2.0 (the "License"); you may
7+
* not use this file except in compliance with the License.
8+
* You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
20+
package org.elasticsearch.index.fielddata;
21+
22+
23+
import org.elasticsearch.index.Index;
24+
import org.elasticsearch.index.fielddata.plain.DocValuesIndexFieldData;
25+
26+
/**
27+
* Specialization of {@link IndexFieldData} for histograms.
28+
*/
29+
public abstract class IndexHistogramFieldData extends DocValuesIndexFieldData implements IndexFieldData<AtomicHistogramFieldData> {
30+
31+
public IndexHistogramFieldData(Index index, String fieldName) {
32+
super(index, fieldName);
33+
}
34+
}

0 commit comments

Comments
 (0)