Skip to content

Conversation

@MaineC
Copy link

@MaineC MaineC commented Jun 9, 2016

Before continuing to read: Consider this to be an experiment, WIP, an early
draft, including a few questions as even the tests will fail occasionally. Putting
this up as PR to start a discussion off of some code - so please do feel free to
take things apart ;)

The implementation itself has been lying around in my GitHub repo for a while which you can easily see from the commit history, I took some time to clean it up and adjust it to the current state of things recently - turns out that thanks to the query refactoring efforts, the introduction of modules and various other changes we made in past months the original code could be trimmed quite a bit.

This PR adds a module to add facilities to compute ranking quality metrics on a
set of pre-annotated query/result sets. Currently only Prec@ is supported as a
metric, adding new metrics is as easy as extending the Evaluator/
RankedListQualityMetric interface.

At the moment there's no REST end point implemented. For illustration purposes
on how to use this stuff look at PrecisionAtRequestTest. Essentially the way it
works is as following:

  • Assuming you've indexed a bunch of documents and want to understand how well
    your chosen ranking function performs for a set of queries:
  • Create a set of queries, for each query add a set of documents you deem
    relevant to each query.
  • Submit this stuff, queries are going to be executed against your index, for
    each query the returned documents are checked against the set of relevant
    documents you supplied, Prec@ is computed and each document that you
    didn't explicitly submit as relevant that was still returned for the query is
    being returned as well.

Caveats:

  • Currently the whole set of annotated queries and documents has to be sent over
    the wire. I believe it would make sense to store this internally.
  • The naming of some classes, methods and fields could use quite a bit of love
    and consistency.
  • There's not REST endpoint yet.
  • There's no integration with the task mgmt framework whatsoever.
  • There's a dependency between this module and the :lang:mustache module that
    I'm pretty sure could be handled in a better way.
  • The integration test I'm referring to above fails in roughly half of the cases
    with a NullPointerException in
    oe.client.transport.support.TransportProxyClient:66 when looking for my
    RankedEvalAction - I assume I forgot to register that someplace important but
    after looking around for some time this morning couldn't figure out where. Any
    enlightening appreciated.

Isabel Drost-Fromm added 27 commits September 30, 2014 15:39
REST layer is still missing - but first execution test via the Java API Client
yields a green unit test result.
After the initial walk through refactor what was there to match the interface
specification we originally came up with for quality tasks. This is not yet
integrated into the (not yet committed) benchmark framework - but at least it
contains all major interfaces needed to add further qa metrics.
Seems like writing a generic value does not work if the value is of type enum.
Switching to use Strings for internal serialisation, keeping the enum for now
Thanks to aleph_zero for spotting this.
Add documentation missing in previous versions.

Also fix the google guava deprecation warning for all methods related to the
Objects class.

Missing for next steps:

1. Support returning unknown document id per query intent,
not only per whole set.

2. Add support for REST endpoint as specified in original design.
And fix template search request handling while at it otherwise test does not
work.
Before continuing to read: Consider this to be an experiment, WIP, an early
draft, including a few questions as even the tests will fail occasionally.

Putting this up as PR to start a discussion off of some code. The implementation
itself has been lying around in my GitHub repo for a while, I took some time to
clean it up and adjust it to the current state of things recently - turns out
that thanks to the query refactoring efforts, the introduction of modules and
various other changes we made in past months the original code could be trimmed
quite a bit.

This PR adds a module to add facilities to compute ranking quality metrics on a
set of pre-annotated query/result sets. Currently only Prec@ is supported as a
metric, adding new metrics is as easy as extending the Evaluator/
RankedListQualityMetric interface.

At the moment there's no REST end point implemented. For illustration purposes
on how to use this stuff look at PrecisionAtRequestTest. Essentially the way it
works is as following:

* Assuming you've indexed a bunch of documents and want to understand how well
your chosen ranking function performs for a set of queries:
* Create a set of queries, for each query add a set of documents you deem
relevant to each query.
* Submit this stuff, queries are going to be executed against your index, for
each query the returned documents are checked against the set of relevant
documents you supplied, Prec@ is computed and each document that you
didn't explicitly submit as relevant that was still returned for the query is
being returned as well.

Caveats:

* Currently the whole set of annotated queries and documents has to be sent over
the wire. I believe it would make sense to store this internally.
* The naming of some classes, methods and fields could use quite a bit of love
and consistency.
* There's not REST endpoint yet.
* There's no integration with the task mgmt framework whatsoever.
* There's a dependency between this module and the :lang:mustache module that
I'm pretty sure could be handled in a better way.
* The integration test I'm referring to above fails in roughly half of the cases
with a NullPointerException in
oe.client.transport.support.TransportProxyClient:66 when looking for my
RankedEvalAction - I assume I forgot to register that someplace important but
after looking around for some time this morning couldn't figure out where. Any
enlightening appreciated.
@s1monw
Copy link
Contributor

s1monw commented Jun 9, 2016

this is a great initiative. What I would be interested in is how you imagine the end-user interface would look like. For instance can you provide example request / response json bodies (no need to be final), this would be a tremendous help!

Currently the whole set of annotated queries and documents has to be sent over the wire. I believe it would make sense to store this internally.

I think sending them over the wire is fine for now. We might go further in a next step and let it fetch queries via a scan / scroll but that is something that can come way later!

There's no integration with the task mgmt framework whatsoever.

this is implicit already it's just not cancelable. I wonder if @nik9000 can help here

There's a dependency between this module and the :lang:mustache module that I'm pretty sure could be handled in a better way.

I think that is ok to have since it's a module? I would focus on usability - dependencies to modules are just fine in such a case?

@Override
protected Collection<Class<? extends Plugin>> nodePlugins() {
return pluginList(RankEvalPlugin.class);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this to work I think you have to also override protected Collection<Class<? extends Plugin>> transportClientPlugins() of set transportClientRatio = 0.0 in the @ClusterScope

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input. Works indeed - and revealed an issue with the draft implementation as it stands atm: After overriding the method you suggested the test failed telling me "[Expected current thread [Thread[elasticsearch[node_sc3][local_transport][T#2],5,TGRP-PrecisionAtRequestTest]] to not be a transport thread. Reason: [Blocking operation]];" - because of this. Looks to me like a bit more work to get this sorted out, so for now the test via transportClientRatio is disabled and marked as NORELEASE

@s1monw s1monw removed the discuss label Jun 10, 2016

dependencies {
compile "com.github.spullara.mustache.java:compiler:0.9.1"
compile project(':modules:lang-mustache')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have modules depending on other modules. This means we will actually load two copies, because what is here now will copy the lang-mustache jars into this module, and this module and lang-mustache will have their own classloaders.

Why do you need a dependency directly on mustache scripting? Should it just be on script templates? Perhaps #16314 is what we need first?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have modules depending on other modules.

That pretty much confirms what I already thought.

Why do you need a dependency directly on mustache scripting? Should it just be on
script templates?

Why I think I need mustache templating for this needs a longer explanation of the thinking behind the approach taken in the PR:

Imagine a user having implemented a search box. The way I've seen this done in the past usually followed a pattern similar to the following: Some UI with a search box to get the user to enter their query. Behind that is a bit of application logic that turns the actual query string the user submitted into something the search backend understands. Often this piece of application logic also adds more information to the final query: This could be time of day to emphasize recent results, it could be operating system to return pricier results for Apple users, it could be estimated geo location to emphasize results closer to the users's location.

The problem I've seen people run into is how to combine all those factors into a ranking function that provides relevant results.

This PR tries to address this problem by providing a means to compare multiple different options of combining these ranking factors by means of a template query based on their performance on a pre-defined set of queries with labeled results.

Hope this helps a bit, working on a draft for the actual REST endpoint for better clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just move this into lang-mustache for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other modules use templating, but do not directly depend on mustache, eg ingest. Can we use templates like any other part of the system should (through the script engine api)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wonder if we should just expect a full blown query in there instead of templates.

I also support that instead.

Copy link
Author

@MaineC MaineC Jun 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to add the conclusion Simon and me came to when talking f2f yesterday - first step is to go for having just the full blown query in there instead of templates.

If we find out that we do need templates further down the road I'll take a look at how ingest is doing things - @rjernst thanks for pointing that out as an example.

@MaineC
Copy link
Author

MaineC commented Jun 13, 2016

this is a great initiative.

Thanks for the encouraging feedback.

What I would be interested in is how you imagine the end-user interface would look like. For
instance can you provide example request / response json bodies (no need to be final), this
would be a tremendous help!

Yeah sure - that was the next step on my list. I know this is kind of working backwards, that's mostly for historical reasons: I first wanted to see if getting the stuff I had to a working state would take more than a couple days (it didn't), post it for feedback to see if it's completely off the rails (apparently it's not complete and utter nonsense) and only add more code on top after that (looking into it now)

Isabel

Isabel Drost-Fromm added 3 commits June 14, 2016 11:37
It runs and tells me that what I'm doing in the transport action is actually
forbidden. Good thing I got it to run, punting to work on fixing the problem
revealed in favour of putting some actual json out there.
@MaineC
Copy link
Author

MaineC commented Jun 14, 2016

Added example request and response json that should about capture what is needed as docs to the RestRankEvalAction class comment.

Again listing caveats explicitly:

  • Those are essentially whiteboarded proposals - other than running them through a json linter there's no validation that they actually work and are sufficient.
  • In the current proposal there's currently just the possibility to compute one metric per request template, arguably it might make sense to compute more then one.

"ip_location": "ams"
},
"doc_ratings": {
"1": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these are doc ID to relevant|not-relevant mappings?

Copy link
Contributor

@s1monw s1monw Jun 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need the type as well? maybe not? also are we going to start with binary ratings or have something more like a scale from 0 to 5? is this useful?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For precision@N all we need is binary ratings. If we want to use stuff like ERR having ratings on a scale is definitely useful.

See here for more information: https://www.youtube.com/watch?v=Ltbd9Atc4TY#t=30m33s

@s1monw
Copy link
Contributor

s1monw commented Jun 14, 2016

I added some comments. I think we should add at least a second maybe a 3rd eval method to this PR before we move on. There is also some API work that needs to be done where @clintongormley needs to jump in but I think we can easily start with something like this.

@MaineC
Copy link
Author

MaineC commented Jun 14, 2016

Those comments do make sense.

About having more than one eval method: Huge +1, should give us an understanding what's missing for more sophisticated evaluations.

API work: For sure. That's what I expected.

Isabel Drost-Fromm added 2 commits June 15, 2016 10:48
@clintongormley
Copy link
Contributor

Hi @MaineC - this is looking very interesting.

I took a look at the REST API and I think it is almost there. I think the request should accept anything that a search request would accept (eg index, type, etc). This would support using https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-dsl-template-query.html[query templates].

We could even support full https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-template.html[search templates] by accepting request_template instead of request. Later we could support fetching test cases from an index by accepting stored_requests in place of requests (but all of this later).

For the ratings I think we need to use an array with the full index/type/id, because docs can be returned from multiple indices, eg something like this:

{
  "requests": [
    {
      "id": "amsterdam_query",
      "request": {
        "index": [
          "foo",
          "bar"
        ],
        "size": 10,
        "query": {
          "bool": {
            "must": [
              {
                "match": {
                  "beverage": "coffee"
                }
              },
              {
                "term": {
                  "browser": {
                    "value": "safari"
                  }
                }
              },
              {
                "term": {
                  "time_of_day": {
                    "value": "morning",
                    "boost": 2
                  }
                }
              },
              {
                "term": {
                  "ip_location": {
                    "value": "ams",
                    "boost": 10
                  }
                }
              }
            ]
          }
        }
      },
      "ratings": [
        {
          "_index": "foo",
          "_type": "some_type",
          "_id": "1",
          "rating": 1
        },
        {
          "_index": "foo",
          "_type": "some_type",
          "_id": "2",
          "rating": 0
        },
        {
          "_index": "foo",
          "_type": "some_type",
          "_id": "3",
          "rating": 1
        },
        {
          "_index": "foo",
          "_type": "some_type",
          "_id": "4",
          "rating": 1
        }
      ]
    }
  ]
}

I was also thinking that there are two ways of using this API:

  • You want good results from all queries, ie to make sure that you're making things better not worse
  • You want the queries to compete with each other to find the better query, eg we could automatically try to include/exclude/change boost/weight different parts of a single query to improve rankings

@MaineC
Copy link
Author

MaineC commented Jun 16, 2016

We could even support full https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-
template.html[search templates] by accepting request_template instead of request.

That would be my personal preference (that is, without having run this through a couple of examples this sounds like a good approach as the parameter conceptually remains the same for users.)

For the ratings I think we need to use an array with the full index/type/id, because docs can be
returned from multiple indices, eg something like this:

Makes sense.

You want the queries to compete with each other to find the better query, eg we could
automatically try to include/exclude/change boost/weight different parts of a single query to
improve rankings

While this sounds like a good idea, I think the term "query" is overloaded multiple times in this proposal. So let me phrase this in a slightly different way to see how this use case fits into the current API (and to see if what I think I understood you saying is actually what you did say):

So one example of the problem we are talking about is the following: We are running a shop system. Users enter what I will call end_user_query like "soldering station", "led", "battery", "egg bot". There are more matching products in our backend than fit on one single page. Each product comes with product_properties like price, weight, availability, colour, maybe some quality score ("suitable for professional use" vs. "for hobbyists only"). We want to come up with an elasticsearch_query that uses the end_user_query and product_properties such that those products are listed on top that are highly likely to be perceived as high quality by the end user and end up being purchased. For an example (taken from code.talks commerce talk, which I attended earlier this year as a speaker) of how this might look like: http://project-a.github.io/on-site-search-design-patterns-for-e-commerce/#data-driven-ranking

So in your proposal you would let multiple elasticsearch_queries compete against each other. For each competition run, you would use a set of maybe 100 end_user_queries. This could be the same set of sampled* end_user_queries for each elasticsearch_query variation (leading to less work when annotating search results assuming there is some overlap in how products are being ranked).

With the API as is I believe this would mean you'd have to restart the process with each elasticsearch_query variation you want to try out. I'm not sure if we should wrap this use case into the API itself.

Hope this makes sense,
Isabel

  • For downstream users there's all sorts of sampling headaches to take into account: Sample over all queries, focus only on "electronics parts" end_user_queries, focus only on queries coming from the US, focus only on queries issued during summer vs. winter etc.

@clintongormley
Copy link
Contributor

With the API as is I believe this would mean you'd have to restart the process with each elasticsearch_query variation you want to try out. I'm not sure if we should wrap this use case into the API itself.

Agreed - this would be implemented by an application. I just mentioned the two use cases in case it affects how we return the results.

@MaineC
Copy link
Author

MaineC commented Jun 20, 2016

With the API as is I believe this would mean you'd have to restart the process with each
elasticsearch_query variation you want to try out. I'm not sure if we should wrap this use case
into the API itself.

Agreed - this would be implemented by an application.

OK.

I just mentioned the two use cases in case it affects how we return the results.

Makes sense. (And I just wanted to figure out, whether or not it should be part of the request API ;) )

@clintongormley
Copy link
Contributor

@MaineC what are the next steps on this PR?

@MaineC
Copy link
Author

MaineC commented Jun 27, 2016

Currently @cbuescher is looking at it, he already suggested a few additional changes. He also suggested to move this PR into a separate branch so we can work on changes in parallel, I think this would make sense.

@s1monw proposed to add at least one more evaluation metric before considering to move forward to see if this fits what we need.

On a related note, @nik9000 put both @cbuescher and me in touch with a guy at Wikimedia who had some input on how they are doing ranking QA, I got permission to share his viewpoint publicly so I guess it makes sense to open a more general issue to track ideas around this PR, no?

cbuescher pushed a commit that referenced this pull request Jun 30, 2016
This is an initial squashed commit of the work on a new feature for query metrics
proposed in #18798.
@MaineC
Copy link
Author

MaineC commented Jun 30, 2016

Closing in favor of collaborating on this code in branch https://github.com/elastic/elasticsearch/tree/feature/rank-eval

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>feature :Search/Search Search-related issues that do not fall into other categories WIP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants