Hint what clauses are important in a conjunction query based on fields #26081

martijnvg · 2017-08-07T13:02:07Z

The percolator field mapper doesn't need to extract all terms and ranges from a bool query with must or filter clauses. In order to help to default extraction behavior, boost fields can be configured, so that fields that are known for not being selective enough can be ignored in favor for other fields or clauses with specific fields can forcefully take precedence over other clauses. This can help selecting clauses for fields that don't match with a lot of percolator queries over other clauses and thus improving performance of the percolate query.

For example a status like field is something that should configured as an ignore field. Queries on this field tend to match with more documents and so if clauses for this fields get selected as best clause then that isn't very helpful for the candidate query that the percolate query generates to filter out percolator queries that are likely not going to match.

jpountz · 2017-08-07T15:20:01Z

I think exposing ways to customize extraction is useful, but I'm wondering whether this ignore_fields option is the right way. For instance we could alternatively allow users to provide a script that gives a score to a (field, term) tuple. Then ignoring fields could be implemented by returning 0 for certain fields?

martijnvg · 2017-08-08T08:07:45Z

I just chatted with @jpountz and we agreed on exposing this ignore_fields as boost_fields instead, but not as a script and just as mapping configuration:

{
   "boost_fields" : {
      "fieldA": 5,
      "fieldB: 0,
      "fieldC": 10,
      ...
}

A boost of zero will ignore a field. Otherwise the clause with highest boost gets selected. By default all clauses have a boost of 1.

martijnvg · 2017-08-08T12:00:38Z

I've changed the PR to use boost_fields instead.

jpountz

I left some comments but it looks better to me!

jpountz · 2017-08-10T11:55:20Z

docs/reference/mapping/types/percolator.asciidoc

maybe we need some kind of list that summarizes the decision process: first look at the boost, then prefer terms over ranges, and finally prefer long terms / narrow ranges over short terms / wide ranges?

jpountz · 2017-08-10T11:56:14Z

modules/percolator/src/main/java/org/elasticsearch/percolator/PercolatorFieldMapper.java

maybe add propNode to the error message so that it is easier to understand what the issue is if a user ever complains of getting this message

jpountz · 2017-08-10T11:57:20Z

modules/percolator/src/main/java/org/elasticsearch/percolator/PercolatorFieldMapper.java

we usually take floats for doubles. Even though it is not necessary here, I'd parse the value as a float or double to be more consistent with other parts of our code base

I initially used floats, but then realized that int sufficed here. But I agree with your consistency argument.

jpountz · 2017-08-10T11:59:59Z

modules/percolator/src/main/java/org/elasticsearch/percolator/QueryAnalyzer.java

should we really return 1 if nothing was extracted? Let's return Integer.MIN_VALUE (or NEGATIVE_INFINITY once we are on floats or doubles) to make sure it won't be selected if another query could be extracted

jpountz · 2017-08-10T12:01:41Z

modules/percolator/src/main/java/org/elasticsearch/percolator/QueryAnalyzer.java

I'd rather like a BiFunction than a wrapper class with a vague name?

martijnvg · 2017-08-10T13:55:00Z

@jpountz Thanks for looking! I've updated the PR.

jpountz

LGTM

jpountz · 2017-08-10T14:55:37Z

docs/reference/mapping/types/percolator.asciidoc

s/with a must/with must/ ?

jpountz · 2017-08-10T14:56:44Z

docs/reference/mapping/types/percolator.asciidoc

s/in/In/ to be consistent with previous lines

jpountz · 2017-08-10T14:58:16Z

modules/percolator/src/main/java/org/elasticsearch/percolator/PercolatorFieldMapper.java

looks like it can be final

jpountz · 2017-08-10T14:58:41Z

modules/percolator/src/main/java/org/elasticsearch/percolator/PercolatorFieldMapper.java

s/contains/equals/ ?

jpountz · 2017-08-10T15:00:59Z

modules/percolator/src/main/java/org/elasticsearch/percolator/QueryAnalyzer.java

this should be covered by the below check about highest boosts?

jpountz · 2017-08-10T15:07:37Z

modules/percolator/src/main/java/org/elasticsearch/percolator/QueryAnalyzer.java

I'm wondering whether we should compare the highest, or lower boosts?
For instance imagine the query is (a OR b) AND c. a, b and c have boosts of 2, 0.5 and 1 respectively. I'd much rather pick c as an extracted query than a OR b since the cost of b might indicate it is a low-cardinality field, ie. each term has many values?

👍 I agree, we are then likely to extract term(s) that are rarer.

If we extract the higest lowest boost.

Also I think want the apply this logic to selecting ranges too (select lowest highest range)?

jpountz · 2017-08-11T12:36:17Z

modules/percolator/src/main/java/org/elasticsearch/percolator/PercolatorFieldMapper.java

s/propName/propValue/

jpountz · 2017-08-11T12:45:13Z

modules/percolator/src/main/java/org/elasticsearch/percolator/QueryAnalyzer.java

could be made more concise/efficient by using Stream#anyMatch or Stream#allMatch

The stream can only be consumed once, so if anyMatch(...) or allMatch(...) is invoked then collect(...) can no longer be invoked. So I'll keep it the way it is now...

oh sorry, I had missed that you used the filtered collection below

…sed on fields The percolator field mapper doesn't need to extract all terms and ranges from a bool query with must or filter clauses. In order to help to default extraction behavior, boost fields can be configured, so that fields that are known for not being selective enough can be ignored in favor for other fields or clauses with specific fields can forcefully take precedence over other clauses. This can help selecting clauses for fields that don't match with a lot of percolator queries over other clauses and thus improving performance of the percolate query. For example a status like field is something that should configured as an ignore field. Queries on this field tend to match with more documents and so if clauses for this fields get selected as best clause then that isn't very helpful for the candidate query that the percolate query generates to filter out percolator queries that are likely not going to match.

extract all clauses from a conjunction query. When clauses from a conjunction are extracted the number of clauses is also stored in an internal doc values field (minimum_should_match field). This field is used by the CoveringQuery and allows the percolator to reduce the number of false positives when selecting candidate matches and in certain cases be absolutely sure that a conjunction candidate match will match and then skip MemoryIndex validation. This can greatly improve performance. Before this change only a single clause was extracted from a conjunction query. The percolator tried to extract the clauses that was rarest in order (based on term length) to attempt less candidate queries to be selected in the first place. However this still method there is still a very high chance that candidate query matches are false positives. This change also removes the influencing query extraction added via elastic#26081 as this is no longer needed because now all conjunction clauses are extracted. https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extraction Closes elastic#26307

extract all clauses from a conjunction query. When clauses from a conjunction are extracted the number of clauses is also stored in an internal doc values field (minimum_should_match field). This field is used by the CoveringQuery and allows the percolator to reduce the number of false positives when selecting candidate matches and in certain cases be absolutely sure that a conjunction candidate match will match and then skip MemoryIndex validation. This can greatly improve performance. Before this change only a single clause was extracted from a conjunction query. The percolator tried to extract the clauses that was rarest in order (based on term length) to attempt less candidate queries to be selected in the first place. However this still method there is still a very high chance that candidate query matches are false positives. This change also removes the influencing query extraction added via #26081 as this is no longer needed because now all conjunction clauses are extracted. https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extraction Closes #26307

martijnvg added :Search Relevance/Percolator Reverse search: find queries that match a document >enhancement review v6.1.0 v7.0.0 labels Aug 7, 2017

martijnvg mentioned this pull request Aug 7, 2017

Improve percolator performance #25445

Closed

9 tasks

martijnvg changed the title ~~Hint what clauses should be ignored in a conjunction query based on ignore fields~~ Hint what clauses are important in a conjunction query based on fields Aug 8, 2017

jpountz suggested changes Aug 10, 2017

View reviewed changes

martijnvg force-pushed the percolator_ignore_fields branch from c7a22d7 to 5e4124a Compare August 10, 2017 13:54

jpountz approved these changes Aug 10, 2017

View reviewed changes

jpountz approved these changes Aug 11, 2017

View reviewed changes

martijnvg force-pushed the percolator_ignore_fields branch from a4906e3 to 636e85e Compare August 11, 2017 13:32

martijnvg merged commit 636e85e into elastic:master Aug 11, 2017

martijnvg mentioned this pull request Nov 6, 2017

Use Lucene's CoveringQuery to select percolate candidate matches #27271

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Hint what clauses are important in a conjunction query based on fields #26081

Hint what clauses are important in a conjunction query based on fields #26081

Uh oh!

Conversation

martijnvg commented Aug 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Aug 7, 2017

Uh oh!

martijnvg commented Aug 8, 2017

Uh oh!

martijnvg commented Aug 8, 2017

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Aug 10, 2017

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

martijnvg commented Aug 7, 2017 •

edited

Loading