Skip to content

Conversation

@martijnvg
Copy link
Member

@martijnvg martijnvg commented Aug 7, 2017

The percolator field mapper doesn't need to extract all terms and ranges from a bool query with must or filter clauses. In order to help to default extraction behavior, boost fields can be configured, so that fields that are known for not being selective enough can be ignored in favor for other fields or clauses with specific fields can forcefully take precedence over other clauses. This can help selecting clauses for fields that don't match with a lot of percolator queries over other clauses and thus improving performance of the percolate query.

For example a status like field is something that should configured as an ignore field. Queries on this field tend to match with more documents and so if clauses for this fields get selected as best clause then that isn't very helpful for the candidate query that the percolate query generates to filter out percolator queries that are likely not going to match.

@martijnvg martijnvg added :Search Relevance/Percolator Reverse search: find queries that match a document >enhancement review v6.1.0 v7.0.0 labels Aug 7, 2017
@martijnvg martijnvg mentioned this pull request Aug 7, 2017
9 tasks
@jpountz
Copy link
Contributor

jpountz commented Aug 7, 2017

I think exposing ways to customize extraction is useful, but I'm wondering whether this ignore_fields option is the right way. For instance we could alternatively allow users to provide a script that gives a score to a (field, term) tuple. Then ignoring fields could be implemented by returning 0 for certain fields?

@martijnvg
Copy link
Member Author

I just chatted with @jpountz and we agreed on exposing this ignore_fields as boost_fields instead, but not as a script and just as mapping configuration:

{
   "boost_fields" : {
      "fieldA": 5,
      "fieldB: 0,
      "fieldC": 10,
      ...
}

A boost of zero will ignore a field. Otherwise the clause with highest boost gets selected. By default all clauses have a boost of 1.

@martijnvg martijnvg changed the title Hint what clauses should be ignored in a conjunction query based on ignore fields Hint what clauses are important in a conjunction query based on fields Aug 8, 2017
@martijnvg
Copy link
Member Author

I've changed the PR to use boost_fields instead.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but it looks better to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need some kind of list that summarizes the decision process: first look at the boost, then prefer terms over ranges, and finally prefer long terms / narrow ranges over short terms / wide ranges?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add propNode to the error message so that it is easier to understand what the issue is if a user ever complains of getting this message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually take floats for doubles. Even though it is not necessary here, I'd parse the value as a float or double to be more consistent with other parts of our code base

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially used floats, but then realized that int sufficed here. But I agree with your consistency argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we really return 1 if nothing was extracted? Let's return Integer.MIN_VALUE (or NEGATIVE_INFINITY once we are on floats or doubles) to make sure it won't be selected if another query could be extracted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather like a BiFunction than a wrapper class with a vague name?

@martijnvg martijnvg force-pushed the percolator_ignore_fields branch from c7a22d7 to 5e4124a Compare August 10, 2017 13:54
@martijnvg
Copy link
Member Author

@jpountz Thanks for looking! I've updated the PR.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/with a must/with must/ ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/in/In/ to be consistent with previous lines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it can be final

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/contains/equals/ ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be covered by the below check about highest boosts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we should compare the highest, or lower boosts?
For instance imagine the query is (a OR b) AND c. a, b and c have boosts of 2, 0.5 and 1 respectively. I'd much rather pick c as an extracted query than a OR b since the cost of b might indicate it is a low-cardinality field, ie. each term has many values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I agree, we are then likely to extract term(s) that are rarer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we extract the higest lowest boost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think want the apply this logic to selecting ranges too (select lowest highest range)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/propName/propValue/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be made more concise/efficient by using Stream#anyMatch or Stream#allMatch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stream can only be consumed once, so if anyMatch(...) or allMatch(...) is invoked then collect(...) can no longer be invoked. So I'll keep it the way it is now...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry, I had missed that you used the filtered collection below

…sed on fields

The percolator field mapper doesn't need to extract all terms and ranges from a bool query with must or filter clauses.
In order to help to default extraction behavior, boost fields can be configured, so that fields that are known for not being
selective enough can be ignored in favor for other fields or clauses with specific fields can forcefully take precedence over other clauses.
This can help selecting clauses for fields that don't match with a lot of percolator queries over other clauses and thus improving performance of the percolate query.

For example a status like field is something that should configured as an ignore field.
Queries on this field tend to match with more documents and so if clauses for this fields
get selected as best clause then that isn't very helpful for the candidate query that the
percolate query generates to filter out percolator queries that are likely not going to match.
@martijnvg martijnvg force-pushed the percolator_ignore_fields branch from a4906e3 to 636e85e Compare August 11, 2017 13:32
@martijnvg martijnvg merged commit 636e85e into elastic:master Aug 11, 2017
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Nov 10, 2017
extract all clauses from a conjunction query.

When clauses from a conjunction are extracted the number of clauses is
also stored in an internal doc values field (minimum_should_match field).
This field is used by the CoveringQuery and allows the percolator to
reduce the number of false positives when selecting candidate matches and
in certain cases be absolutely sure that a conjunction candidate match
will match and then skip MemoryIndex validation. This can greatly improve
performance.

Before this change only a single clause was extracted from a conjunction
query. The percolator tried to extract the clauses that was rarest in order
(based on term length) to attempt less candidate queries to be selected
in the first place. However this still method there is still a very high
chance that candidate query matches are false positives.

This change also removes the influencing query extraction added via elastic#26081
as this is no longer needed because now all conjunction clauses are extracted.

https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extraction

Closes elastic#26307
martijnvg added a commit that referenced this pull request Nov 10, 2017
extract all clauses from a conjunction query.

When clauses from a conjunction are extracted the number of clauses is
also stored in an internal doc values field (minimum_should_match field).
This field is used by the CoveringQuery and allows the percolator to
reduce the number of false positives when selecting candidate matches and
in certain cases be absolutely sure that a conjunction candidate match
will match and then skip MemoryIndex validation. This can greatly improve
performance.

Before this change only a single clause was extracted from a conjunction
query. The percolator tried to extract the clauses that was rarest in order
(based on term length) to attempt less candidate queries to be selected
in the first place. However this still method there is still a very high
chance that candidate query matches are false positives.

This change also removes the influencing query extraction added via #26081
as this is no longer needed because now all conjunction clauses are extracted.

https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extraction

Closes #26307
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Percolator Reverse search: find queries that match a document v6.1.0 v7.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants