-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
This relates to #18798.
This effort tries to help downstream users figure out whether the ranking function they are using produces good enough search result rankings.
To validate our API does the right thing we focus on annotation based quality metrics only (Prec@, MRR, ERR, see also https://en.wikipedia.org/wiki/Mean_reciprocal_rank and https://en.wikipedia.org/wiki/Learning_to_rank#Evaluation_measures).
The API takes a set of queries, each with a set of search results, and their relevance wrt. to the query. We focus on binary, maybe graded relevancy for now, it's explicitly not a goal to have very generic ratings per result.
For simplicity reasons those evaluation datasets will be supplied at query time instead of being stored and accessible in a dedicated index. Supporting to resume quality computations is something we decided to care about once users start asking for it.
What we leave to the user:
- Logging queries. We assume that what most users want to do with that API is to figure out how well their mapping from user supplied query and all sorts of additional information works for ranking search results. As a result logging plain elastic search queries isn't very helpful. What might be helpful is logging query parameters, like e.g. in template requests. We keep that in mind as future work.
- Creating annotations. This step is left to the downstream user, who need to figure out how to assign some quality level to search results given a query. It's also left to the downstream user to build a query annotation UI.
As a second step we envision enabling users to automatically learn which weights to assign to query parameters.
As a third step we envision downstream users being able to monitor their ranking quality over time. This involves building a QA specific UI that helps talk to the API outlined above, maybe integrating watcher for constant monitoring.