Experiment: Module for computing prec@ on queries #18798

MaineC · 2016-06-09T10:48:46Z

Before continuing to read: Consider this to be an experiment, WIP, an early
draft, including a few questions as even the tests will fail occasionally. Putting
this up as PR to start a discussion off of some code - so please do feel free to
take things apart ;)

The implementation itself has been lying around in my GitHub repo for a while which you can easily see from the commit history, I took some time to clean it up and adjust it to the current state of things recently - turns out that thanks to the query refactoring efforts, the introduction of modules and various other changes we made in past months the original code could be trimmed quite a bit.

This PR adds a module to add facilities to compute ranking quality metrics on a
set of pre-annotated query/result sets. Currently only Prec@ is supported as a
metric, adding new metrics is as easy as extending the Evaluator/
RankedListQualityMetric interface.

At the moment there's no REST end point implemented. For illustration purposes
on how to use this stuff look at PrecisionAtRequestTest. Essentially the way it
works is as following:

Assuming you've indexed a bunch of documents and want to understand how well
your chosen ranking function performs for a set of queries:
Create a set of queries, for each query add a set of documents you deem
relevant to each query.
Submit this stuff, queries are going to be executed against your index, for
each query the returned documents are checked against the set of relevant
documents you supplied, Prec@ is computed and each document that you
didn't explicitly submit as relevant that was still returned for the query is
being returned as well.

Caveats:

Currently the whole set of annotated queries and documents has to be sent over
the wire. I believe it would make sense to store this internally.
The naming of some classes, methods and fields could use quite a bit of love
and consistency.
There's not REST endpoint yet.
There's no integration with the task mgmt framework whatsoever.
There's a dependency between this module and the :lang:mustache module that
I'm pretty sure could be handled in a better way.
The integration test I'm referring to above fails in roughly half of the cases
with a NullPointerException in
oe.client.transport.support.TransportProxyClient:66 when looking for my
RankedEvalAction - I assume I forgot to register that someplace important but
after looking around for some time this morning couldn't figure out where. Any
enlightening appreciated.

…hrough the benchmarking framework.

REST layer is still missing - but first execution test via the Java API Client yields a green unit test result.

After the initial walk through refactor what was there to match the interface specification we originally came up with for quality tasks. This is not yet integrated into the (not yet committed) benchmark framework - but at least it contains all major interfaces needed to add further qa metrics.

Seems like writing a generic value does not work if the value is of type enum. Switching to use Strings for internal serialisation, keeping the enum for now

Thanks to aleph_zero for spotting this.

Add documentation missing in previous versions. Also fix the google guava deprecation warning for all methods related to the Objects class. Missing for next steps: 1. Support returning unknown document id per query intent, not only per whole set. 2. Add support for REST endpoint as specified in original design.

And fix template search request handling while at it otherwise test does not work.

Before continuing to read: Consider this to be an experiment, WIP, an early draft, including a few questions as even the tests will fail occasionally. Putting this up as PR to start a discussion off of some code. The implementation itself has been lying around in my GitHub repo for a while, I took some time to clean it up and adjust it to the current state of things recently - turns out that thanks to the query refactoring efforts, the introduction of modules and various other changes we made in past months the original code could be trimmed quite a bit. This PR adds a module to add facilities to compute ranking quality metrics on a set of pre-annotated query/result sets. Currently only Prec@ is supported as a metric, adding new metrics is as easy as extending the Evaluator/ RankedListQualityMetric interface. At the moment there's no REST end point implemented. For illustration purposes on how to use this stuff look at PrecisionAtRequestTest. Essentially the way it works is as following: * Assuming you've indexed a bunch of documents and want to understand how well your chosen ranking function performs for a set of queries: * Create a set of queries, for each query add a set of documents you deem relevant to each query. * Submit this stuff, queries are going to be executed against your index, for each query the returned documents are checked against the set of relevant documents you supplied, Prec@ is computed and each document that you didn't explicitly submit as relevant that was still returned for the query is being returned as well. Caveats: * Currently the whole set of annotated queries and documents has to be sent over the wire. I believe it would make sense to store this internally. * The naming of some classes, methods and fields could use quite a bit of love and consistency. * There's not REST endpoint yet. * There's no integration with the task mgmt framework whatsoever. * There's a dependency between this module and the :lang:mustache module that I'm pretty sure could be handled in a better way. * The integration test I'm referring to above fails in roughly half of the cases with a NullPointerException in oe.client.transport.support.TransportProxyClient:66 when looking for my RankedEvalAction - I assume I forgot to register that someplace important but after looking around for some time this morning couldn't figure out where. Any enlightening appreciated.

s1monw · 2016-06-09T11:03:51Z

this is a great initiative. What I would be interested in is how you imagine the end-user interface would look like. For instance can you provide example request / response json bodies (no need to be final), this would be a tremendous help!

Currently the whole set of annotated queries and documents has to be sent over the wire. I believe it would make sense to store this internally.

I think sending them over the wire is fine for now. We might go further in a next step and let it fetch queries via a scan / scroll but that is something that can come way later!

There's no integration with the task mgmt framework whatsoever.

this is implicit already it's just not cancelable. I wonder if @nik9000 can help here

There's a dependency between this module and the :lang:mustache module that I'm pretty sure could be handled in a better way.

I think that is ok to have since it's a module? I would focus on usability - dependencies to modules are just fine in such a case?

s1monw · 2016-06-09T11:09:31Z

modules/rank-eval/src/test/java/org/elasticsearch/action/quality/PrecisionAtRequestTest.java

+    @Override
+    protected Collection<Class<? extends Plugin>> nodePlugins() {
+        return pluginList(RankEvalPlugin.class);
+    }


for this to work I think you have to also override protected Collection<Class<? extends Plugin>> transportClientPlugins() of set transportClientRatio = 0.0 in the @ClusterScope

Thanks for the input. Works indeed - and revealed an issue with the draft implementation as it stands atm: After overriding the method you suggested the test failed telling me "[Expected current thread [Thread[elasticsearch[node_sc3][local_transport][T#2],5,TGRP-PrecisionAtRequestTest]] to not be a transport thread. Reason: [Blocking operation]];" - because of this. Looks to me like a bit more work to get this sorted out, so for now the test via transportClientRatio is disabled and marked as NORELEASE

rjernst · 2016-06-10T19:37:19Z

modules/rank-eval/build.gradle

+
+dependencies {
+  compile "com.github.spullara.mustache.java:compiler:0.9.1"
+  compile project(':modules:lang-mustache')


We can't have modules depending on other modules. This means we will actually load two copies, because what is here now will copy the lang-mustache jars into this module, and this module and lang-mustache will have their own classloaders.

Why do you need a dependency directly on mustache scripting? Should it just be on script templates? Perhaps #16314 is what we need first?

We can't have modules depending on other modules.

That pretty much confirms what I already thought.

Why do you need a dependency directly on mustache scripting? Should it just be on
script templates?

Why I think I need mustache templating for this needs a longer explanation of the thinking behind the approach taken in the PR:

Imagine a user having implemented a search box. The way I've seen this done in the past usually followed a pattern similar to the following: Some UI with a search box to get the user to enter their query. Behind that is a bit of application logic that turns the actual query string the user submitted into something the search backend understands. Often this piece of application logic also adds more information to the final query: This could be time of day to emphasize recent results, it could be operating system to return pricier results for Apple users, it could be estimated geo location to emphasize results closer to the users's location.

The problem I've seen people run into is how to combine all those factors into a ranking function that provides relevant results.

This PR tries to address this problem by providing a means to compare multiple different options of combining these ranking factors by means of a template query based on their performance on a pre-defined set of queries with labeled results.

Hope this helps a bit, working on a draft for the actual REST endpoint for better clarity.

can we just move this into lang-mustache for now?

Other modules use templating, but do not directly depend on mustache, eg ingest. Can we use templates like any other part of the system should (through the script engine api)?

I really wonder if we should just expect a full blown query in there instead of templates.

I also support that instead.

Sorry, forgot to add the conclusion Simon and me came to when talking f2f yesterday - first step is to go for having just the full blown query in there instead of templates.

If we find out that we do need templates further down the road I'll take a look at how ingest is doing things - @rjernst thanks for pointing that out as an example.

MaineC · 2016-06-13T12:44:45Z

this is a great initiative.

Thanks for the encouraging feedback.

What I would be interested in is how you imagine the end-user interface would look like. For
instance can you provide example request / response json bodies (no need to be final), this
would be a tremendous help!

Yeah sure - that was the next step on my list. I know this is kind of working backwards, that's mostly for historical reasons: I first wanted to see if getting the stuff I had to a working state would take more than a couple days (it didn't), post it for feedback to see if it's completely off the rails (apparently it's not complete and utter nonsense) and only add more code on top after that (looking into it now)

Isabel

It runs and tells me that what I'm doing in the transport action is actually forbidden. Good thing I got it to run, punting to work on fixing the problem revealed in favour of putting some actual json out there.

MaineC · 2016-06-14T09:58:16Z

Added example request and response json that should about capture what is needed as docs to the RestRankEvalAction class comment.

Again listing caveats explicitly:

Those are essentially whiteboarded proposals - other than running them through a json linter there's no validation that they actually work and are sufficient.
In the current proposal there's currently just the possibility to compute one metric per request template, arguably it might make sense to compute more then one.

s1monw · 2016-06-14T10:27:41Z

modules/rank-eval/src/main/java/org/elasticsearch/index/rankeval/RestRankEvalAction.java

+            "ip_location": "ams"
+        },
+        "doc_ratings": {
+            "1": 1,


so these are doc ID to relevant|not-relevant mappings?

I wonder if we need the type as well? maybe not? also are we going to start with binary ratings or have something more like a scale from 0 to 5? is this useful?

For precision@N all we need is binary ratings. If we want to use stuff like ERR having ratings on a scale is definitely useful.

See here for more information: https://www.youtube.com/watch?v=Ltbd9Atc4TY#t=30m33s

s1monw · 2016-06-14T10:37:26Z

I added some comments. I think we should add at least a second maybe a 3rd eval method to this PR before we move on. There is also some API work that needs to be done where @clintongormley needs to jump in but I think we can easily start with something like this.

MaineC · 2016-06-14T10:53:03Z

Those comments do make sense.

About having more than one eval method: Huge +1, should give us an understanding what's missing for more sophisticated evaluations.

API work: For sure. That's what I expected.

instead of relying directly on templates.

clintongormley · 2016-06-15T15:17:07Z

Hi @MaineC - this is looking very interesting.

I took a look at the REST API and I think it is almost there. I think the request should accept anything that a search request would accept (eg index, type, etc). This would support using https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-template-query.html[query templates].

We could even support full https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-template.html[search templates] by accepting request_template instead of request. Later we could support fetching test cases from an index by accepting stored_requests in place of requests (but all of this later).

For the ratings I think we need to use an array with the full index/type/id, because docs can be returned from multiple indices, eg something like this:

{
  "requests": [
    {
      "id": "amsterdam_query",
      "request": {
        "index": [
          "foo",
          "bar"
        ],
        "size": 10,
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "beverage": "coffee"
                }
              },
              {
                "term": {
                  "browser": {
                    "value": "safari"
                  }
                }
              },
              {
                "term": {
                  "time_of_day": {
                    "value": "morning",
                    "boost": 2
                  }
                }
              },
              {
                "term": {
                  "ip_location": {
                    "value": "ams",
                    "boost": 10
                  }
                }
              }
            ]
          }
        }
      },
      "ratings": [
        {
          "_index": "foo",
          "_type": "some_type",
          "_id": "1",
          "rating": 1
        },
        {
          "_index": "foo",
          "_type": "some_type",
          "_id": "2",
          "rating": 0
        },
        {
          "_index": "foo",
          "_type": "some_type",
          "_id": "3",
          "rating": 1
        },
        {
          "_index": "foo",
          "_type": "some_type",
          "_id": "4",
          "rating": 1
        }
      ]
    }
  ]
}

I was also thinking that there are two ways of using this API:

You want good results from all queries, ie to make sure that you're making things better not worse
You want the queries to compete with each other to find the better query, eg we could automatically try to include/exclude/change boost/weight different parts of a single query to improve rankings

MaineC · 2016-06-16T07:31:02Z

We could even support full https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-
template.html[search templates] by accepting request_template instead of request.

That would be my personal preference (that is, without having run this through a couple of examples this sounds like a good approach as the parameter conceptually remains the same for users.)

For the ratings I think we need to use an array with the full index/type/id, because docs can be
returned from multiple indices, eg something like this:

Makes sense.

You want the queries to compete with each other to find the better query, eg we could
automatically try to include/exclude/change boost/weight different parts of a single query to
improve rankings

While this sounds like a good idea, I think the term "query" is overloaded multiple times in this proposal. So let me phrase this in a slightly different way to see how this use case fits into the current API (and to see if what I think I understood you saying is actually what you did say):

So one example of the problem we are talking about is the following: We are running a shop system. Users enter what I will call end_user_query like "soldering station", "led", "battery", "egg bot". There are more matching products in our backend than fit on one single page. Each product comes with product_properties like price, weight, availability, colour, maybe some quality score ("suitable for professional use" vs. "for hobbyists only"). We want to come up with an elasticsearch_query that uses the end_user_query and product_properties such that those products are listed on top that are highly likely to be perceived as high quality by the end user and end up being purchased. For an example (taken from code.talks commerce talk, which I attended earlier this year as a speaker) of how this might look like: http://project-a.github.io/on-site-search-design-patterns-for-e-commerce/#data-driven-ranking

So in your proposal you would let multiple elasticsearch_queries compete against each other. For each competition run, you would use a set of maybe 100 end_user_queries. This could be the same set of sampled* end_user_queries for each elasticsearch_query variation (leading to less work when annotating search results assuming there is some overlap in how products are being ranked).

With the API as is I believe this would mean you'd have to restart the process with each elasticsearch_query variation you want to try out. I'm not sure if we should wrap this use case into the API itself.

Hope this makes sense,
Isabel

For downstream users there's all sorts of sampling headaches to take into account: Sample over all queries, focus only on "electronics parts" end_user_queries, focus only on queries coming from the US, focus only on queries issued during summer vs. winter etc.

clintongormley · 2016-06-16T07:52:27Z

With the API as is I believe this would mean you'd have to restart the process with each elasticsearch_query variation you want to try out. I'm not sure if we should wrap this use case into the API itself.

Agreed - this would be implemented by an application. I just mentioned the two use cases in case it affects how we return the results.

MaineC · 2016-06-20T06:58:51Z

With the API as is I believe this would mean you'd have to restart the process with each
elasticsearch_query variation you want to try out. I'm not sure if we should wrap this use case
into the API itself.

Agreed - this would be implemented by an application.

OK.

I just mentioned the two use cases in case it affects how we return the results.

Makes sense. (And I just wanted to figure out, whether or not it should be part of the request API ;) )

clintongormley · 2016-06-23T12:48:09Z

@MaineC what are the next steps on this PR?

MaineC · 2016-06-27T04:49:39Z

Currently @cbuescher is looking at it, he already suggested a few additional changes. He also suggested to move this PR into a separate branch so we can work on changes in parallel, I think this would make sense.

@s1monw proposed to add at least one more evaluation metric before considering to move forward to see if this fits what we need.

On a related note, @nik9000 put both @cbuescher and me in touch with a guy at Wikimedia who had some input on how they are doing ranking QA, I got permission to share his viewpoint publicly so I guess it makes sense to open a more general issue to track ideas around this PR, no?

This is an initial squashed commit of the work on a new feature for query metrics proposed in #18798.

MaineC · 2016-06-30T18:33:28Z

Closing in favor of collaborating on this code in branch https://github.com/elastic/elasticsearch/tree/feature/rank-eval

Isabel Drost-Fromm added 27 commits September 30, 2014 15:39

First draft for precision at computation.

bbb64eb

Minor modification

a2af593

Factor metric computation into separate class.

f6c947e

Modifications to make the test compile at least.

51eaeee

Next increment

b32c483

Stop point before kicking out everything I believe might be covered t…

6789f7c

…hrough the benchmarking framework.

PrecisionAt evaluation at Java API level.

9f4fdd4

REST layer is still missing - but first execution test via the Java API Client yields a green unit test result.

Fix serialisation bug - cannot deal with enum

a036469

Seems like writing a generic value does not work if the value is of type enum. Switching to use Strings for internal serialisation, keeping the enum for now

Fix action name.

2f2835b

Make tests pass. Thanks to aleph_zero for helping out.

ed5b9a5

Fix unit test to actually contain the spec.

ef8d4d0

Thanks to aleph_zero for spotting this.

Clearer logging (refactored toString)

847e12b

Switch to unknown doc per intent.

fa6197c

And fix template search request handling while at it otherwise test does not work.

Simplify template handling in API

ffcaf07

Support specifying multiple target indeces.

81be9b3

Add support for configuring n in precision at n.

4607038

Major refactoring to move all non precision stuff out

c7fa073

Typo in class name, one class not needed. Fixing test seed.

86fdd45

Request integration

3488566

Merge branch 'master' into feature/rank-eval

0212d9a

Setup sub module for rank evaluation

39d4fa8

Merge branch 'master' into feature/rank-eval

d5eda0d

Merge branch 'feature/rank-eval' into plugin/rank-eval

1b6c06a

Merge branch 'master' into plugin/rank-eval

4895361

MaineC added >feature discuss WIP labels Jun 9, 2016

s1monw reviewed Jun 9, 2016
View reviewed changes

s1monw removed the discuss label Jun 10, 2016

rjernst reviewed Jun 10, 2016
View reviewed changes

Merge branch 'master' into plugin/rank-eval

5d80b79

Isabel Drost-Fromm added 3 commits June 14, 2016 11:37

Get integration test over transport client to work.

09cd5a6

It runs and tells me that what I'm doing in the transport action is actually forbidden. Good thing I got it to run, punting to work on fixing the problem revealed in favour of putting some actual json out there.

Example for request json.

0c6938b

Add example response

7ac09f6

s1monw reviewed Jun 14, 2016
View reviewed changes

Isabel Drost-Fromm added 2 commits June 15, 2016 10:48

Adjust request format to have full blown query ...

3438a0a

instead of relying directly on templates.

Rename parameters.

3127a95

Remove template dependency, start REST test for API

2081c8d

cbuescher pushed a commit that referenced this pull request Jun 30, 2016

Initial commit for Module to compute metrics on queries

6d4673f

This is an initial squashed commit of the work on a new feature for query metrics proposed in #18798.

MaineC closed this Jun 30, 2016

MaineC mentioned this pull request Jun 30, 2016

Add ranking evaluation API to Elasticsearch #19195

Closed

clintongormley removed the v6.0.0-alpha1 label Jul 1, 2016

Experiment: Module for computing prec@ on queries #18798

Experiment: Module for computing prec@ on queries #18798

Uh oh!

Conversation

MaineC commented Jun 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s1monw commented Jun 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaineC Jun 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaineC commented Jun 13, 2016

Uh oh!

MaineC commented Jun 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw Jun 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented Jun 14, 2016

Uh oh!

MaineC commented Jun 14, 2016

Uh oh!

clintongormley commented Jun 15, 2016

Uh oh!

MaineC commented Jun 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintongormley commented Jun 16, 2016

Uh oh!

MaineC commented Jun 20, 2016

Uh oh!

clintongormley commented Jun 23, 2016

Uh oh!

MaineC commented Jun 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaineC commented Jun 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaineC commented Jun 9, 2016 •

edited

Loading

MaineC Jun 15, 2016 •

edited

Loading

MaineC commented Jun 14, 2016 •

edited

Loading

s1monw Jun 14, 2016 •

edited

Loading

MaineC commented Jun 16, 2016 •

edited

Loading

MaineC commented Jun 27, 2016 •

edited

Loading