Skip to content

Commit 09bf93d

Browse files
authored
Add intervals query (#36135)
* Add IntervalQueryBuilder with support for match and combine intervals * Add relative intervals * feedback * YAML test - broekn * yaml test; begin to add block source * Add block; make disjunction its own source * WIP * Extract IntervalBuilder and add tests for it * Fix eq/hashcode in Disjunction * New yaml test * checkstyle * license headers * test fix * YAML format * YAML formatting again * yaml tests; javadoc * Add OR test -> requires fix from LUCENE-8586 * Add docs * Re-do API * Clint's API * Delete bash script * doc fixes * imports * docs * test fix * feedback * comma * docs fixes * Tidy up doc references to old rule
1 parent 278cc4c commit 09bf93d

File tree

14 files changed

+2047
-45
lines changed

14 files changed

+2047
-45
lines changed

docs/reference/query-dsl/full-text-queries.asciidoc

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,11 @@ The queries in this group are:
4040
A simpler, more robust version of the `query_string` syntax suitable
4141
for exposing directly to users.
4242

43+
<<query-dsl-intervals-query,`intervals` query>>::
44+
45+
A full text query that allows fine-grained control of the ordering and
46+
proximity of matching terms
47+
4348
include::match-query.asciidoc[]
4449

4550
include::match-phrase-query.asciidoc[]
@@ -53,3 +58,5 @@ include::common-terms-query.asciidoc[]
5358
include::query-string-query.asciidoc[]
5459

5560
include::simple-query-string-query.asciidoc[]
61+
62+
include::intervals-query.asciidoc[]
Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
[[query-dsl-intervals-query]]
2+
=== Intervals query
3+
4+
An `intervals` query allows fine-grained control over the order and proximity of
5+
matching terms. Matching rules are constructed from a small set of definitions,
6+
and the rules are then applied to terms from a particular `field`.
7+
8+
The definitions produce sequences of minimal intervals that span terms in a
9+
body of text. These intervals can be further combined and filtered by
10+
parent sources.
11+
12+
The example below will search for the phrase `my favourite food` appearing
13+
before the terms `hot` and `water` or `cold` and `porridge` in any order, in
14+
the field `my_text`
15+
16+
[source,js]
17+
--------------------------------------------------
18+
POST _search
19+
{
20+
"query": {
21+
"intervals" : {
22+
"my_text" : {
23+
"all_of" : {
24+
"ordered" : true,
25+
"intervals" : [
26+
{
27+
"match" : {
28+
"query" : "my favourite food",
29+
"max_gaps" : 0,
30+
"ordered" : true
31+
}
32+
},
33+
{
34+
"any_of" : {
35+
"intervals" : [
36+
{ "match" : { "query" : "hot water" } },
37+
{ "match" : { "query" : "cold porridge" } }
38+
]
39+
}
40+
}
41+
]
42+
},
43+
"boost" : 2.0,
44+
"_name" : "favourite_food"
45+
}
46+
}
47+
}
48+
}
49+
--------------------------------------------------
50+
// CONSOLE
51+
52+
In the above example, the text `my favourite food is cold porridge` would
53+
match because the two intervals matching `my favourite food` and `cold
54+
porridge` appear in the correct order, but the text `when it's cold my
55+
favourite food is porridge` would not match, because the interval matching
56+
`cold porridge` starts before the interval matching `my favourite food`.
57+
58+
[[intervals-match]]
59+
==== `match`
60+
61+
The `match` rule matches analyzed text, and takes the following parameters:
62+
63+
[horizontal]
64+
`query`::
65+
The text to match.
66+
`max_gaps`::
67+
Specify a maximum number of gaps between the terms in the text. Terms that
68+
appear further apart than this will not match. If unspecified, or set to -1,
69+
then there is no width restriction on the match. If set to 0 then the terms
70+
must appear next to each other.
71+
`ordered`::
72+
Whether or not the terms must appear in their specified order. Defaults to
73+
`false`
74+
`analyzer`::
75+
Which analyzer should be used to analyze terms in the `query`. By
76+
default, the search analyzer of the top-level field will be used.
77+
`filter`::
78+
An optional <<interval_filter,interval filter>>
79+
80+
[[intervals-all_of]]
81+
==== `all_of`
82+
83+
`all_of` returns returns matches that span a combination of other rules.
84+
85+
[horizontal]
86+
`intervals`::
87+
An array of rules to combine. All rules must produce a match in a
88+
document for the overall source to match.
89+
`max_gaps`::
90+
Specify a maximum number of gaps between the rules. Combinations that match
91+
across a distance greater than this will not match. If set to -1 or
92+
unspecified, there is no restriction on this distance. If set to 0, then the
93+
matches produced by the rules must all appear immediately next to each other.
94+
`ordered`::
95+
Whether the intervals produced by the rules should appear in the order in
96+
which they are specified. Defaults to `false`
97+
`filter`::
98+
An optional <<interval_filter,interval filter>>
99+
100+
[[intervals-any_of]]
101+
==== `any_of`
102+
103+
The `any_of` rule emits intervals produced by any of its sub-rules.
104+
105+
[horizontal]
106+
`intervals`::
107+
An array of rules to match
108+
`filter`::
109+
An optional <<interval_filter,interval filter>>
110+
111+
[[interval_filter]]
112+
==== filters
113+
114+
You can filter intervals produced by any rules by their relation to the
115+
intervals produced by another rule. The following example will return
116+
documents that have the words `hot` and `porridge` within 10 positions
117+
of each other, without the word `salty` in between:
118+
119+
[source,js]
120+
--------------------------------------------------
121+
POST _search
122+
{
123+
"query": {
124+
"intervals" : {
125+
"my_text" : {
126+
"match" : {
127+
"query" : "hot porridge",
128+
"max_gaps" : 10,
129+
"filter" : {
130+
"not_containing" : {
131+
"match" : {
132+
"query" : "salty"
133+
}
134+
}
135+
}
136+
}
137+
}
138+
}
139+
}
140+
}
141+
--------------------------------------------------
142+
// CONSOLE
143+
144+
The following filters are available:
145+
[horizontal]
146+
`containing`::
147+
Produces intervals that contain an interval from the filter rule
148+
`contained_by`::
149+
Produces intervals that are contained by an interval from the filter rule
150+
`not_containing`::
151+
Produces intervals that do not contain an interval from the filter rule
152+
`not_contained_by`::
153+
Produces intervals that are not contained by an interval from the filter rule
154+
`not_overlapping`::
155+
Produces intervals that do not overlap with an interval from the filter rule
156+
157+
[[interval-minimization]]
158+
==== Minimization
159+
160+
The intervals query always minimizes intervals, to ensure that queries can
161+
run in linear time. This can sometimes cause surprising results, particularly
162+
when using `max_gaps` restrictions or filters. For example, take the
163+
following query, searching for `salty` contained within the phrase `hot
164+
porridge`:
165+
166+
[source,js]
167+
--------------------------------------------------
168+
POST _search
169+
{
170+
"query": {
171+
"intervals" : {
172+
"my_text" : {
173+
"match" : {
174+
"query" : "salty",
175+
"filter" : {
176+
"contained_by" : {
177+
"match" : {
178+
"query" : "hot porridge"
179+
}
180+
}
181+
}
182+
}
183+
}
184+
}
185+
}
186+
}
187+
--------------------------------------------------
188+
// CONSOLE
189+
190+
This query will *not* match a document containing the phrase `hot porridge is
191+
salty porridge`, because the intervals returned by the match query for `hot
192+
porridge` only cover the initial two terms in this document, and these do not
193+
overlap the intervals covering `salty`.
194+
195+
Another restriction to be aware of is the case of `any_of` rules that contain
196+
sub-rules which overlap. In particular, if one of the rules is a strict
197+
prefix of the other, then the longer rule will never be matched, which can
198+
cause surprises when used in combination with `max_gaps`. Consider the
199+
following query, searching for `the` immediately followed by `big` or `big bad`,
200+
immediately followed by `wolf`:
201+
202+
[source,js]
203+
--------------------------------------------------
204+
POST _search
205+
{
206+
"query": {
207+
"intervals" : {
208+
"my_text" : {
209+
"all_of" : {
210+
"intervals" : [
211+
{ "match" : { "query" : "the" } },
212+
{ "any_of" : {
213+
"intervals" : [
214+
{ "match" : { "query" : "big" } },
215+
{ "match" : { "query" : "big bad" } }
216+
] } },
217+
{ "match" : { "query" : "wolf" } }
218+
],
219+
"max_gaps" : 0,
220+
"ordered" : true
221+
}
222+
}
223+
}
224+
}
225+
}
226+
--------------------------------------------------
227+
// CONSOLE
228+
229+
Counter-intuitively, this query *will not* match the document `the big bad
230+
wolf`, because the `any_of` rule in the middle will only produce intervals
231+
for `big` - intervals for `big bad` being longer than those for `big`, while
232+
starting at the same position, and so being minimized away. In these cases,
233+
it's better to rewrite the query so that all of the options are explicitly
234+
laid out at the top level:
235+
236+
[source,js]
237+
--------------------------------------------------
238+
POST _search
239+
{
240+
"query": {
241+
"intervals" : {
242+
"my_text" : {
243+
"any_of" : {
244+
"intervals" : [
245+
{ "match" : {
246+
"query" : "the big bad wolf",
247+
"ordered" : true,
248+
"max_gaps" : 0 } },
249+
{ "match" : {
250+
"query" : "the big wolf",
251+
"ordered" : true,
252+
"max_gaps" : 0 } }
253+
]
254+
}
255+
}
256+
}
257+
}
258+
}
259+
--------------------------------------------------
260+
// CONSOLE

0 commit comments

Comments
 (0)