Skip to content

Conversation

@martijnvg
Copy link
Member

@martijnvg martijnvg commented Jun 22, 2017

At index time the percolator tries to extract the longest string that doesn't contain a ? or * from the wildcard expression. At search time each query term is expanded into all possible suffixes and then each suffix is turned in all possible prefixes, this to match with any possible extracted wildcard expression.

This can speed evaluating percolator queries containing wildcard queries as without this change a lot of times all these percolator queries need to be evaluated all time irregardless if they have no chance of ever matching.

@martijnvg martijnvg added :Search Relevance/Percolator Reverse search: find queries that match a document WIP >enhancement review v6.0.0 and removed WIP labels Jun 22, 2017
@martijnvg martijnvg force-pushed the percolator_wildcard_query_support branch from 8c32178 to e48de2c Compare June 26, 2017 18:19
@martijnvg martijnvg mentioned this pull request Jun 28, 2017
9 tasks
…hes containing wildcard and prefix queries.

At index time the percolator tries to extract the longest string that doesn't contain a `?` or `*` from the wildcard expression.
At search time each query term is expanded into all possible suffixes and then each suffix is turned in all possible prefixes,
this to match with any  possible extracted wildcard expression.

This can speed evaluating percolator queries containing wildcard queries as without this change a lot of times all these percolator
queries need to be evaluated all time irregardless if they have no chance of ever matching.
@martijnvg martijnvg force-pushed the percolator_wildcard_query_support branch from e48de2c to 252fdc2 Compare July 6, 2017 10:14
@jpountz jpountz self-requested a review July 6, 2017 10:15
@jpountz
Copy link
Contributor

jpountz commented Jul 6, 2017

I'm worried this could lead to very large candidate queries if the input document is not tiny?

@martijnvg
Copy link
Member Author

@jpountz Good point. Perhaps we can build in a limitation? If we insert a special token in the wildcard query terms field to identify all percolator queries with prefix/wildcard queries. At query time if we detect that we create to many suffix terms (50?) or suffix terms or too long (25?) then we just use the special token instead.

@jpountz
Copy link
Contributor

jpountz commented Jul 6, 2017

I don't know. My gut feeling is that things can degrade pretty quicly. Even if we only extract substrings of length eg. 4, a token of length 20 in a document would generate 20-4 = 16 underlying terms for the candidate query. My gut feeling is that even simple documents already trigger the creation of non trivial candidate queries. I'm a bit worried of making them even more complex.

Maybe the right thing to do is to leave it up to the users? They would just have to use (edge) ngrams in their index analyzers?

@martijnvg
Copy link
Member Author

Maybe the right thing to do is to leave it up to the users? They would just have to use (edge) ngrams in their index analyzers?

Right, maybe that is better. Also it would be clearer why percolation is slower instead of when the percolator is doing what this PR is doing. I'll add some documentation around this. It does mean that wildcard and prefix queries would need to be substituted with term queries in the percolator queries.

@martijnvg
Copy link
Member Author

Closing this PR as it can have negative performance impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search Relevance/Percolator Reverse search: find queries that match a document v6.0.0-beta2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants