Skip to content

Commit 4dc833f

Browse files
authored
Add doc_count field mapper (#64503)
Bucket aggregations compute bucket doc_count values by incrementing the doc_count by 1 for every document collected in the bucket. When using summary fields (such as aggregate_metric_double) one field may represent more than one document. To provide this functionality we have implemented a new field mapper (named doc_count field mapper). This field is a positive integer representing the number of documents aggregated in a single summary field. Bucket aggregations will check if a field of type doc_count exists in a document and will take this value into consideration when computing doc counts.
1 parent 4add5cb commit 4dc833f

File tree

22 files changed

+786
-63
lines changed

22 files changed

+786
-63
lines changed

docs/reference/mapping/fields.asciidoc

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,13 @@ fields can be customized when a mapping is created.
2929
The size of the `_source` field in bytes, provided by the
3030
{plugins}/mapper-size.html[`mapper-size` plugin].
3131

32+
q[discrete]
33+
=== Doc count metadata field
34+
35+
<<mapping-doc-count-field,`_doc_count`>>::
36+
37+
A custom field used for storing doc counts when a document represents pre-aggregated data.
38+
3239
[discrete]
3340
=== Indexing metadata fields
3441

@@ -55,6 +62,7 @@ fields can be customized when a mapping is created.
5562

5663
Application specific metadata.
5764

65+
include::fields/doc-count-field.asciidoc[]
5866

5967
include::fields/field-names-field.asciidoc[]
6068

@@ -69,4 +77,3 @@ include::fields/meta-field.asciidoc[]
6977
include::fields/routing-field.asciidoc[]
7078

7179
include::fields/source-field.asciidoc[]
72-
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
[[mapping-doc-count-field]]
2+
=== `_doc_count` data type
3+
++++
4+
<titleabbrev>_doc_count</titleabbrev>
5+
++++
6+
7+
Bucket aggregations always return a field named `doc_count` showing the number of documents that were aggregated and partitioned
8+
in each bucket. Computation of the value of `doc_count` is very simple. `doc_count` is incremented by 1 for every document collected
9+
in each bucket.
10+
11+
While this simple approach is effective when computing aggregations over individual documents, it fails to accurately represent
12+
documents that store pre-aggregated data (such as `histogram` or `aggregate_metric_double` fields), because one summary field may
13+
represent multiple documents.
14+
15+
To allow for correct computation of the number of documents when working with pre-aggregated data, we have introduced a
16+
metadata field type named `_doc_count`. `_doc_count` must always be a positive integer representing the number of documents
17+
aggregated in a single summary field.
18+
19+
When field `_doc_count` is added to a document, all bucket aggregations will respect its value and increment the bucket `doc_count`
20+
by the value of the field. If a document does not contain any `_doc_count` field, `_doc_count = 1` is implied by default.
21+
22+
[IMPORTANT]
23+
========
24+
* A `_doc_count` field can only store a single positive integer per document. Nested arrays are not allowed.
25+
* If a document contains no `_doc_count` fields, aggregators will increment by 1, which is the default behavior.
26+
========
27+
28+
[[mapping-doc-count-field-example]]
29+
==== Example
30+
31+
The following <<indices-create-index, create index>> API request creates a new index with the following field mappings:
32+
33+
* `my_histogram`, a `histogram` field used to store percentile data
34+
* `my_text`, a `keyword` field used to store a title for the histogram
35+
36+
[source,console]
37+
--------------------------------------------------
38+
PUT my_index
39+
{
40+
"mappings" : {
41+
"properties" : {
42+
"my_histogram" : {
43+
"type" : "histogram"
44+
},
45+
"my_text" : {
46+
"type" : "keyword"
47+
}
48+
}
49+
}
50+
}
51+
--------------------------------------------------
52+
53+
The following <<docs-index_,index>> API requests store pre-aggregated data for
54+
two histograms: `histogram_1` and `histogram_2`.
55+
56+
[source,console]
57+
--------------------------------------------------
58+
PUT my_index/_doc/1
59+
{
60+
"my_text" : "histogram_1",
61+
"my_histogram" : {
62+
"values" : [0.1, 0.2, 0.3, 0.4, 0.5],
63+
"counts" : [3, 7, 23, 12, 6]
64+
},
65+
"_doc_count": 45 <1>
66+
}
67+
68+
PUT my_index/_doc/2
69+
{
70+
"my_text" : "histogram_2",
71+
"my_histogram" : {
72+
"values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5],
73+
"counts" : [8, 17, 8, 7, 6, 2]
74+
},
75+
"_doc_count_": 62 <1>
76+
}
77+
--------------------------------------------------
78+
<1> Field `_doc_count` must be a positive integer storing the number of documents aggregated to produce each histogram.
79+
80+
If we run the following <<search-aggregations-bucket-terms-aggregation, terms aggregation>> on `my_index`:
81+
82+
[source,console]
83+
--------------------------------------------------
84+
GET /_search
85+
{
86+
"aggs" : {
87+
"histogram_titles" : {
88+
"terms" : { "field" : "my_text" }
89+
}
90+
}
91+
}
92+
--------------------------------------------------
93+
94+
We will get the following response:
95+
96+
[source,console-result]
97+
--------------------------------------------------
98+
{
99+
...
100+
"aggregations" : {
101+
"histogram_titles" : {
102+
"doc_count_error_upper_bound": 0,
103+
"sum_other_doc_count": 0,
104+
"buckets" : [
105+
{
106+
"key" : "histogram_2",
107+
"doc_count" : 62
108+
},
109+
{
110+
"key" : "histogram_1",
111+
"doc_count" : 45
112+
}
113+
]
114+
}
115+
}
116+
}
117+
--------------------------------------------------
118+
// TESTRESPONSE[skip:test not setup]
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
setup:
2+
- do:
3+
indices.create:
4+
index: test_1
5+
body:
6+
settings:
7+
number_of_replicas: 0
8+
mappings:
9+
properties:
10+
str:
11+
type: keyword
12+
number:
13+
type: integer
14+
15+
- do:
16+
bulk:
17+
index: test_1
18+
refresh: true
19+
body:
20+
- '{"index": {}}'
21+
- '{"_doc_count": 10, "str": "abc", "number" : 500, "unmapped": "abc" }'
22+
- '{"index": {}}'
23+
- '{"_doc_count": 5, "str": "xyz", "number" : 100, "unmapped": "xyz" }'
24+
- '{"index": {}}'
25+
- '{"_doc_count": 7, "str": "foo", "number" : 100, "unmapped": "foo" }'
26+
- '{"index": {}}'
27+
- '{"_doc_count": 1, "str": "foo", "number" : 200, "unmapped": "foo" }'
28+
- '{"index": {}}'
29+
- '{"str": "abc", "number" : 500, "unmapped": "abc" }'
30+
31+
---
32+
"Test numeric terms agg with doc_count":
33+
- skip:
34+
version: " - 7.99.99"
35+
reason: "Doc count fields are only implemented in 8.0"
36+
37+
- do:
38+
search:
39+
rest_total_hits_as_int: true
40+
body: { "size" : 0, "aggs" : { "num_terms" : { "terms" : { "field" : "number" } } } }
41+
42+
- match: { hits.total: 5 }
43+
- length: { aggregations.num_terms.buckets: 3 }
44+
- match: { aggregations.num_terms.buckets.0.key: 100 }
45+
- match: { aggregations.num_terms.buckets.0.doc_count: 12 }
46+
- match: { aggregations.num_terms.buckets.1.key: 500 }
47+
- match: { aggregations.num_terms.buckets.1.doc_count: 11 }
48+
- match: { aggregations.num_terms.buckets.2.key: 200 }
49+
- match: { aggregations.num_terms.buckets.2.doc_count: 1 }
50+
51+
52+
---
53+
"Test keyword terms agg with doc_count":
54+
- skip:
55+
version: " - 7.99.99"
56+
reason: "Doc count fields are only implemented in 8.0"
57+
- do:
58+
search:
59+
rest_total_hits_as_int: true
60+
body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str" } } } }
61+
62+
- match: { hits.total: 5 }
63+
- length: { aggregations.str_terms.buckets: 3 }
64+
- match: { aggregations.str_terms.buckets.0.key: "abc" }
65+
- match: { aggregations.str_terms.buckets.0.doc_count: 11 }
66+
- match: { aggregations.str_terms.buckets.1.key: "foo" }
67+
- match: { aggregations.str_terms.buckets.1.doc_count: 8 }
68+
- match: { aggregations.str_terms.buckets.2.key: "xyz" }
69+
- match: { aggregations.str_terms.buckets.2.doc_count: 5 }
70+
71+
---
72+
73+
"Test unmapped string terms agg with doc_count":
74+
- skip:
75+
version: " - 7.99.99"
76+
reason: "Doc count fields are only implemented in 8.0"
77+
- do:
78+
bulk:
79+
index: test_2
80+
refresh: true
81+
body:
82+
- '{"index": {}}'
83+
- '{"_doc_count": 10, "str": "abc" }'
84+
- '{"index": {}}'
85+
- '{"str": "abc" }'
86+
- do:
87+
search:
88+
index: test_2
89+
rest_total_hits_as_int: true
90+
body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str.keyword" } } } }
91+
92+
- match: { hits.total: 2 }
93+
- length: { aggregations.str_terms.buckets: 1 }
94+
- match: { aggregations.str_terms.buckets.0.key: "abc" }
95+
- match: { aggregations.str_terms.buckets.0.doc_count: 11 }
96+
97+
---
98+
"Test composite str_terms agg with doc_count":
99+
- skip:
100+
version: " - 7.99.99"
101+
reason: "Doc count fields are only implemented in 8.0"
102+
- do:
103+
search:
104+
rest_total_hits_as_int: true
105+
body: { "size" : 0, "aggs" :
106+
{ "composite_agg" : { "composite" :
107+
{
108+
"sources": ["str_terms": { "terms": { "field": "str" } }]
109+
}
110+
}
111+
}
112+
}
113+
114+
- match: { hits.total: 5 }
115+
- length: { aggregations.composite_agg.buckets: 3 }
116+
- match: { aggregations.composite_agg.buckets.0.key.str_terms: "abc" }
117+
- match: { aggregations.composite_agg.buckets.0.doc_count: 11 }
118+
- match: { aggregations.composite_agg.buckets.1.key.str_terms: "foo" }
119+
- match: { aggregations.composite_agg.buckets.1.doc_count: 8 }
120+
- match: { aggregations.composite_agg.buckets.2.key.str_terms: "xyz" }
121+
- match: { aggregations.composite_agg.buckets.2.doc_count: 5 }
122+
123+
124+
---
125+
"Test composite num_terms agg with doc_count":
126+
- skip:
127+
version: " - 7.99.99"
128+
reason: "Doc count fields are only implemented in 8.0"
129+
- do:
130+
search:
131+
rest_total_hits_as_int: true
132+
body: { "size" : 0, "aggs" :
133+
{ "composite_agg" :
134+
{ "composite" :
135+
{
136+
"sources": ["num_terms" : { "terms" : { "field" : "number" } }]
137+
}
138+
}
139+
}
140+
}
141+
142+
- match: { hits.total: 5 }
143+
- length: { aggregations.composite_agg.buckets: 3 }
144+
- match: { aggregations.composite_agg.buckets.0.key.num_terms: 100 }
145+
- match: { aggregations.composite_agg.buckets.0.doc_count: 12 }
146+
- match: { aggregations.composite_agg.buckets.1.key.num_terms: 200 }
147+
- match: { aggregations.composite_agg.buckets.1.doc_count: 1 }
148+
- match: { aggregations.composite_agg.buckets.2.key.num_terms: 500 }
149+
- match: { aggregations.composite_agg.buckets.2.doc_count: 11 }
150+

0 commit comments

Comments
 (0)