-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Experiment: Module for computing prec@ on queries #18798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…hrough the benchmarking framework.
REST layer is still missing - but first execution test via the Java API Client yields a green unit test result.
After the initial walk through refactor what was there to match the interface specification we originally came up with for quality tasks. This is not yet integrated into the (not yet committed) benchmark framework - but at least it contains all major interfaces needed to add further qa metrics.
Seems like writing a generic value does not work if the value is of type enum. Switching to use Strings for internal serialisation, keeping the enum for now
Thanks to aleph_zero for spotting this.
Add documentation missing in previous versions. Also fix the google guava deprecation warning for all methods related to the Objects class. Missing for next steps: 1. Support returning unknown document id per query intent, not only per whole set. 2. Add support for REST endpoint as specified in original design.
And fix template search request handling while at it otherwise test does not work.
Before continuing to read: Consider this to be an experiment, WIP, an early draft, including a few questions as even the tests will fail occasionally. Putting this up as PR to start a discussion off of some code. The implementation itself has been lying around in my GitHub repo for a while, I took some time to clean it up and adjust it to the current state of things recently - turns out that thanks to the query refactoring efforts, the introduction of modules and various other changes we made in past months the original code could be trimmed quite a bit. This PR adds a module to add facilities to compute ranking quality metrics on a set of pre-annotated query/result sets. Currently only Prec@ is supported as a metric, adding new metrics is as easy as extending the Evaluator/ RankedListQualityMetric interface. At the moment there's no REST end point implemented. For illustration purposes on how to use this stuff look at PrecisionAtRequestTest. Essentially the way it works is as following: * Assuming you've indexed a bunch of documents and want to understand how well your chosen ranking function performs for a set of queries: * Create a set of queries, for each query add a set of documents you deem relevant to each query. * Submit this stuff, queries are going to be executed against your index, for each query the returned documents are checked against the set of relevant documents you supplied, Prec@ is computed and each document that you didn't explicitly submit as relevant that was still returned for the query is being returned as well. Caveats: * Currently the whole set of annotated queries and documents has to be sent over the wire. I believe it would make sense to store this internally. * The naming of some classes, methods and fields could use quite a bit of love and consistency. * There's not REST endpoint yet. * There's no integration with the task mgmt framework whatsoever. * There's a dependency between this module and the :lang:mustache module that I'm pretty sure could be handled in a better way. * The integration test I'm referring to above fails in roughly half of the cases with a NullPointerException in oe.client.transport.support.TransportProxyClient:66 when looking for my RankedEvalAction - I assume I forgot to register that someplace important but after looking around for some time this morning couldn't figure out where. Any enlightening appreciated.
|
this is a great initiative. What I would be interested in is how you imagine the end-user interface would look like. For instance can you provide example request / response json bodies (no need to be final), this would be a tremendous help!
I think sending them over the wire is fine for now. We might go further in a next step and let it fetch queries via a scan / scroll but that is something that can come way later!
this is implicit already it's just not cancelable. I wonder if @nik9000 can help here
I think that is ok to have since it's a module? I would focus on usability - dependencies to modules are just fine in such a case? |
| @Override | ||
| protected Collection<Class<? extends Plugin>> nodePlugins() { | ||
| return pluginList(RankEvalPlugin.class); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this to work I think you have to also override protected Collection<Class<? extends Plugin>> transportClientPlugins() of set transportClientRatio = 0.0 in the @ClusterScope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the input. Works indeed - and revealed an issue with the draft implementation as it stands atm: After overriding the method you suggested the test failed telling me "[Expected current thread [Thread[elasticsearch[node_sc3][local_transport][T#2],5,TGRP-PrecisionAtRequestTest]] to not be a transport thread. Reason: [Blocking operation]];" - because of this. Looks to me like a bit more work to get this sorted out, so for now the test via transportClientRatio is disabled and marked as NORELEASE
modules/rank-eval/build.gradle
Outdated
|
|
||
| dependencies { | ||
| compile "com.github.spullara.mustache.java:compiler:0.9.1" | ||
| compile project(':modules:lang-mustache') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have modules depending on other modules. This means we will actually load two copies, because what is here now will copy the lang-mustache jars into this module, and this module and lang-mustache will have their own classloaders.
Why do you need a dependency directly on mustache scripting? Should it just be on script templates? Perhaps #16314 is what we need first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have modules depending on other modules.
That pretty much confirms what I already thought.
Why do you need a dependency directly on mustache scripting? Should it just be on
script templates?
Why I think I need mustache templating for this needs a longer explanation of the thinking behind the approach taken in the PR:
Imagine a user having implemented a search box. The way I've seen this done in the past usually followed a pattern similar to the following: Some UI with a search box to get the user to enter their query. Behind that is a bit of application logic that turns the actual query string the user submitted into something the search backend understands. Often this piece of application logic also adds more information to the final query: This could be time of day to emphasize recent results, it could be operating system to return pricier results for Apple users, it could be estimated geo location to emphasize results closer to the users's location.
The problem I've seen people run into is how to combine all those factors into a ranking function that provides relevant results.
This PR tries to address this problem by providing a means to compare multiple different options of combining these ranking factors by means of a template query based on their performance on a pre-defined set of queries with labeled results.
Hope this helps a bit, working on a draft for the actual REST endpoint for better clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just move this into lang-mustache for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other modules use templating, but do not directly depend on mustache, eg ingest. Can we use templates like any other part of the system should (through the script engine api)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really wonder if we should just expect a full blown query in there instead of templates.
I also support that instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to add the conclusion Simon and me came to when talking f2f yesterday - first step is to go for having just the full blown query in there instead of templates.
If we find out that we do need templates further down the road I'll take a look at how ingest is doing things - @rjernst thanks for pointing that out as an example.
Thanks for the encouraging feedback.
Yeah sure - that was the next step on my list. I know this is kind of working backwards, that's mostly for historical reasons: I first wanted to see if getting the stuff I had to a working state would take more than a couple days (it didn't), post it for feedback to see if it's completely off the rails (apparently it's not complete and utter nonsense) and only add more code on top after that (looking into it now) Isabel |
It runs and tells me that what I'm doing in the transport action is actually forbidden. Good thing I got it to run, punting to work on fixing the problem revealed in favour of putting some actual json out there.
|
Added example request and response json that should about capture what is needed as docs to the RestRankEvalAction class comment. Again listing caveats explicitly:
|
| "ip_location": "ams" | ||
| }, | ||
| "doc_ratings": { | ||
| "1": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so these are doc ID to relevant|not-relevant mappings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need the type as well? maybe not? also are we going to start with binary ratings or have something more like a scale from 0 to 5? is this useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For precision@N all we need is binary ratings. If we want to use stuff like ERR having ratings on a scale is definitely useful.
See here for more information: https://www.youtube.com/watch?v=Ltbd9Atc4TY#t=30m33s
|
I added some comments. I think we should add at least a second maybe a 3rd eval method to this PR before we move on. There is also some API work that needs to be done where @clintongormley needs to jump in but I think we can easily start with something like this. |
|
Those comments do make sense. About having more than one eval method: Huge +1, should give us an understanding what's missing for more sophisticated evaluations. API work: For sure. That's what I expected. |
instead of relying directly on templates.
|
Hi @MaineC - this is looking very interesting. I took a look at the REST API and I think it is almost there. I think the We could even support full https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-template.html[search templates] by accepting For the I was also thinking that there are two ways of using this API:
|
That would be my personal preference (that is, without having run this through a couple of examples this sounds like a good approach as the parameter conceptually remains the same for users.)
Makes sense.
While this sounds like a good idea, I think the term "query" is overloaded multiple times in this proposal. So let me phrase this in a slightly different way to see how this use case fits into the current API (and to see if what I think I understood you saying is actually what you did say): So one example of the problem we are talking about is the following: We are running a shop system. Users enter what I will call end_user_query like "soldering station", "led", "battery", "egg bot". There are more matching products in our backend than fit on one single page. Each product comes with product_properties like price, weight, availability, colour, maybe some quality score ("suitable for professional use" vs. "for hobbyists only"). We want to come up with an elasticsearch_query that uses the end_user_query and product_properties such that those products are listed on top that are highly likely to be perceived as high quality by the end user and end up being purchased. For an example (taken from code.talks commerce talk, which I attended earlier this year as a speaker) of how this might look like: http://project-a.github.io/on-site-search-design-patterns-for-e-commerce/#data-driven-ranking So in your proposal you would let multiple elasticsearch_queries compete against each other. For each competition run, you would use a set of maybe 100 end_user_queries. This could be the same set of sampled* end_user_queries for each elasticsearch_query variation (leading to less work when annotating search results assuming there is some overlap in how products are being ranked). With the API as is I believe this would mean you'd have to restart the process with each elasticsearch_query variation you want to try out. I'm not sure if we should wrap this use case into the API itself. Hope this makes sense,
|
Agreed - this would be implemented by an application. I just mentioned the two use cases in case it affects how we return the results. |
OK.
Makes sense. (And I just wanted to figure out, whether or not it should be part of the request API ;) ) |
|
@MaineC what are the next steps on this PR? |
|
Currently @cbuescher is looking at it, he already suggested a few additional changes. He also suggested to move this PR into a separate branch so we can work on changes in parallel, I think this would make sense. @s1monw proposed to add at least one more evaluation metric before considering to move forward to see if this fits what we need. On a related note, @nik9000 put both @cbuescher and me in touch with a guy at Wikimedia who had some input on how they are doing ranking QA, I got permission to share his viewpoint publicly so I guess it makes sense to open a more general issue to track ideas around this PR, no? |
This is an initial squashed commit of the work on a new feature for query metrics proposed in #18798.
|
Closing in favor of collaborating on this code in branch https://github.com/elastic/elasticsearch/tree/feature/rank-eval |
Before continuing to read: Consider this to be an experiment, WIP, an early
draft, including a few questions as even the tests will fail occasionally. Putting
this up as PR to start a discussion off of some code - so please do feel free to
take things apart ;)
The implementation itself has been lying around in my GitHub repo for a while which you can easily see from the commit history, I took some time to clean it up and adjust it to the current state of things recently - turns out that thanks to the query refactoring efforts, the introduction of modules and various other changes we made in past months the original code could be trimmed quite a bit.
This PR adds a module to add facilities to compute ranking quality metrics on a
set of pre-annotated query/result sets. Currently only Prec@ is supported as a
metric, adding new metrics is as easy as extending the Evaluator/
RankedListQualityMetric interface.
At the moment there's no REST end point implemented. For illustration purposes
on how to use this stuff look at PrecisionAtRequestTest. Essentially the way it
works is as following:
your chosen ranking function performs for a set of queries:
relevant to each query.
each query the returned documents are checked against the set of relevant
documents you supplied, Prec@ is computed and each document that you
didn't explicitly submit as relevant that was still returned for the query is
being returned as well.
Caveats:
the wire. I believe it would make sense to store this internally.
and consistency.
I'm pretty sure could be handled in a better way.
with a NullPointerException in
oe.client.transport.support.TransportProxyClient:66 when looking for my
RankedEvalAction - I assume I forgot to register that someplace important but
after looking around for some time this morning couldn't figure out where. Any
enlightening appreciated.