-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Hint what clauses are important in a conjunction query based on fields #26081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think exposing ways to customize extraction is useful, but I'm wondering whether this |
|
I just chatted with @jpountz and we agreed on exposing this A boost of zero will ignore a field. Otherwise the clause with highest boost gets selected. By default all clauses have a boost of 1. |
|
I've changed the PR to use |
jpountz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments but it looks better to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we need some kind of list that summarizes the decision process: first look at the boost, then prefer terms over ranges, and finally prefer long terms / narrow ranges over short terms / wide ranges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add propNode to the error message so that it is easier to understand what the issue is if a user ever complains of getting this message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually take floats for doubles. Even though it is not necessary here, I'd parse the value as a float or double to be more consistent with other parts of our code base
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially used floats, but then realized that int sufficed here. But I agree with your consistency argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we really return 1 if nothing was extracted? Let's return Integer.MIN_VALUE (or NEGATIVE_INFINITY once we are on floats or doubles) to make sure it won't be selected if another query could be extracted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather like a BiFunction than a wrapper class with a vague name?
c7a22d7 to
5e4124a
Compare
|
@jpountz Thanks for looking! I've updated the PR. |
jpountz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/with a must/with must/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/in/In/ to be consistent with previous lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it can be final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/contains/equals/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be covered by the below check about highest boosts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether we should compare the highest, or lower boosts?
For instance imagine the query is (a OR b) AND c. a, b and c have boosts of 2, 0.5 and 1 respectively. I'd much rather pick c as an extracted query than a OR b since the cost of b might indicate it is a low-cardinality field, ie. each term has many values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I agree, we are then likely to extract term(s) that are rarer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we extract the higest lowest boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think want the apply this logic to selecting ranges too (select lowest highest range)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/propName/propValue/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be made more concise/efficient by using Stream#anyMatch or Stream#allMatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stream can only be consumed once, so if anyMatch(...) or allMatch(...) is invoked then collect(...) can no longer be invoked. So I'll keep it the way it is now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh sorry, I had missed that you used the filtered collection below
…sed on fields The percolator field mapper doesn't need to extract all terms and ranges from a bool query with must or filter clauses. In order to help to default extraction behavior, boost fields can be configured, so that fields that are known for not being selective enough can be ignored in favor for other fields or clauses with specific fields can forcefully take precedence over other clauses. This can help selecting clauses for fields that don't match with a lot of percolator queries over other clauses and thus improving performance of the percolate query. For example a status like field is something that should configured as an ignore field. Queries on this field tend to match with more documents and so if clauses for this fields get selected as best clause then that isn't very helpful for the candidate query that the percolate query generates to filter out percolator queries that are likely not going to match.
a4906e3 to
636e85e
Compare
extract all clauses from a conjunction query. When clauses from a conjunction are extracted the number of clauses is also stored in an internal doc values field (minimum_should_match field). This field is used by the CoveringQuery and allows the percolator to reduce the number of false positives when selecting candidate matches and in certain cases be absolutely sure that a conjunction candidate match will match and then skip MemoryIndex validation. This can greatly improve performance. Before this change only a single clause was extracted from a conjunction query. The percolator tried to extract the clauses that was rarest in order (based on term length) to attempt less candidate queries to be selected in the first place. However this still method there is still a very high chance that candidate query matches are false positives. This change also removes the influencing query extraction added via elastic#26081 as this is no longer needed because now all conjunction clauses are extracted. https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extraction Closes elastic#26307
extract all clauses from a conjunction query. When clauses from a conjunction are extracted the number of clauses is also stored in an internal doc values field (minimum_should_match field). This field is used by the CoveringQuery and allows the percolator to reduce the number of false positives when selecting candidate matches and in certain cases be absolutely sure that a conjunction candidate match will match and then skip MemoryIndex validation. This can greatly improve performance. Before this change only a single clause was extracted from a conjunction query. The percolator tried to extract the clauses that was rarest in order (based on term length) to attempt less candidate queries to be selected in the first place. However this still method there is still a very high chance that candidate query matches are false positives. This change also removes the influencing query extraction added via #26081 as this is no longer needed because now all conjunction clauses are extracted. https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extraction Closes #26307
The percolator field mapper doesn't need to extract all terms and ranges from a bool query with must or filter clauses. In order to help to default extraction behavior, boost fields can be configured, so that fields that are known for not being selective enough can be ignored in favor for other fields or clauses with specific fields can forcefully take precedence over other clauses. This can help selecting clauses for fields that don't match with a lot of percolator queries over other clauses and thus improving performance of the
percolatequery.For example a status like field is something that should configured as an ignore field. Queries on this field tend to match with more documents and so if clauses for this fields get selected as best clause then that isn't very helpful for the candidate query that the
percolatequery generates to filter out percolator queries that are likely not going to match.