-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML][Inference] Fix model pagination with models as resources #51573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML][Inference] Fix model pagination with models as resources #51573
Conversation
|
Pinging @elastic/ml-core (:ml) |
x-pack/plugin/src/test/resources/rest-api-spec/test/ml/inference_crud.yml
Show resolved
Hide resolved
x-pack/plugin/src/test/resources/rest-api-spec/test/ml/inference_crud.yml
Show resolved
Hide resolved
| } | ||
|
|
||
| public void testExpandIdsPagination() { | ||
| //NOTE: these test assume that the query pagination results are "buffered" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| //NOTE: these test assume that the query pagination results are "buffered" | |
| // NOTE: these tests assume that the query pagination results are "buffered" |
| public void expandIds(String idExpression, | ||
| boolean allowNoResources, | ||
| @Nullable PageParams pageParams, | ||
| PageParams pageParams, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, are pageParams guaranteed to be non-null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the pageParams parameter really necessary? I'd argue that I would expect expandIds to return all the ids rather than a page as the id small compared to a full document.
The default value of PageParams.getSize() is 100. Without setting a page param object you may not realise that limitation is in place and that you are not getting all of your ids. Is there even a mechanism to say you need to get the second page now?
Compare to JobConfigProvider.expandJobIds which gives you AnomalyDetectorsIndex.CONFIG_INDEX_MAX_RESULTS_WINDOW = 10_000 ids.
It would also simplify this code to remove to the pageParams which is a very good reason to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the
pageParamsparameter really necessary?
Yes, I think it is. The APIs making calls against this function require paging params (default is from: 0, size: 100). I think this is a sane way of handling larger numbers of items.
Compare to
JobConfigProvider.expandJobIdswhich gives youAnomalyDetectorsIndex.CONFIG_INDEX_MAX_RESULTS_WINDOW = 10_000ids.
I think that this is a bug and shows that we don't support good pagination at all with anomaly detection jobs.
Is there even a mechanism to say you need to get the second page now?
Yes, the total count of all the ids found with id pattern is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, are
pageParamsguaranteed to be non-null?
It is not possible to set them as null via the API since you can only set from and size individually.
If it is null from a client, they just don't set those parameters (and thus get the default value).
If an internal caller is making these calls, they should not explicitly set pageParams to null and if they do, they will receive an error pretty quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern was that with default paging we open ourselves up to bugs such as elastic/kibana#38559.
I assumed getTrainedModel would take paging parameters in which case paging in expandIds would be redundant but I see from the way the code is used that is not the case and GetTrainedModelsStatsAction explicitly depends on paging ids.
AD jobs are not paged because when they lived in clusterstate all where returned by the GET API. It would have been a breaking change to require paging on that API when we move to the .ml-config index and we could not make a breaking change in that release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the apis GET _ml/trained_models and GET _ml/trained_models/_stats accept paging parameters and use those when expanding the IDs.
Since they are stored in indices, paging parameters are a must (even for expanding IDs) and might as well have it handled now instead of worrying about adding them later when we reach scaling issues.
| if (bufferedFrom > 0) { | ||
| allFoundIds.remove(allFoundIds.first()); | ||
| bufferedFrom--; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | |
| } else { |
| while (allFoundIds.size() > sizeLimit) { | ||
| // If we are still over limit, and have buffered items, that means the first ids belong on the previous page | ||
| if (bufferedFrom > 0) { | ||
| allFoundIds.remove(allFoundIds.first()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it makes sense to optimize this, but I could imagine having an iterator going through allFoundIds and calling .remove() on the iterator rather than removing the smallest item from the collection directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, popping the iterator, for sure. I actually find it more readable to have a remote(..first()). These sets are usually "small" and right now the highest number that we will "buffer" is the total number of models stored as resources (which is 1).
| client::search); | ||
| } | ||
|
|
||
| static Set<String> collectIds(PageParams pageParams, Set<String> foundFromResources, Set<String> foundFromDocs, long totalMatchedIds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this method somewhat hard to follow.
Should the logic be similar to merge sort? I.e. we have two sorted collections and take size elements (in total) from whichever collection has the lowest id at the front. from would be per-collection (so we'd have two froms).
LMK if you think it's possible to rewrite it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be rewritten like a merge sort, but the issue still resides "From which side of the page do we remove items?" And we would not know the ends of the pages until the set is fully merged. So, I am not sure how it would be helpful.
przemekwitek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
davidkyle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's much tricker than I thought.
| if (tags.isEmpty()) { | ||
| foundResourceIds.addAll(matchedResourceIds(tokens)); | ||
| } else { | ||
| for(String resourceId : matchedResourceIds(tokens)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated nit: Can MODELS_STORED_AS_RESOURCE be a Collections.unmodifiableSet then line 520
return new HashSet<>(MODELS_STORED_AS_RESOURCE);
in matchedResourceIds wouldn't have to wrap a set in a set
| requiredMatches.filterMatchedIds(foundResourceIds); | ||
| requiredMatches.filterMatchedIds(allFoundIds); | ||
| if (requiredMatches.hasUnmatchedIds()) { | ||
| idsListener.onFailure(ExceptionsHelper.missingTrainedModel(requiredMatches.unmatchedIdsString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this error in a valid situation?
from = 0, size = 10, idExpression = 'bar*,foo'
I have the models bar-1, bar-2 ... bar-10 and foo.
The first 10 bar-n are returned by the search but foo is unmatched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidkyle i think this was a problem well before I added this logic. Data frame analytics suffers from this as well.
I think it is OK to error in this case.
I may be wrong though (about it being OK) :D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly it's not part of this change but let's follow up on it. I don't think we should be reporting en error when we don't know if it is an error or not, maybe we can relax the check
| // We "buffer" the from and size to take into account models stored as resources. | ||
| // This is so we handle the edge cases when the model that is stored as a resource is at the start/end of | ||
| // a page. | ||
| .from(Math.max(0, pageParams.getFrom() - foundResourceIds.size())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the strategy is to get extra then figure out where the resource model ids would fit in that sorted list 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidkyle exactly. If we don't have the buffer, AND the resource model ID is at the start of the list, we don't know if it is ACTUALLY at the start of the page, or the END of the previous page.
Hence the need for a buffer.
| int bufferedFrom = Math.min(foundFromResources.size(), from); | ||
|
|
||
| // If size = 10_000 but there aren't that many total IDs, reduce the size here to make following logic simpler | ||
| int sizeLimit = (int)Math.min(pageParams.getSize(), totalMatchedIds - from); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of search hits and and resource model ids are < pageParams.getSize() then that is the size limit. from isn't a factor here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think from is a factor here. totalMatchedIds is the total hits, which means the total number of ids that matched the query which could be WAY bigger than the size.
If there are a total of 20 IDs, and the user is doing from: 19, size:200, we should give them the rest of the IDs. Our size limit in this case shouldn't be 200, it should be 1
.../ml/src/main/java/org/elasticsearch/xpack/ml/inference/persistence/TrainedModelProvider.java
Show resolved
Hide resolved
| allFoundIds.remove(allFoundIds.last()); | ||
| } | ||
| } | ||
| return allFoundIds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My way of looking at it is this: we have a request for 10 ids and 2 models stored as resource. We query for 14 ids so we can figure out where the resource model ids would fit. We have a result like
AAOOOOOOOOOOBB
where the As and Bs are the padding.
If the model resource Ids come before the As then they are outside the ordering and all the Os are returned. Same if the model resource ids come after the Bs.
If they are inserted middle (ids M & N) we have an ordering like
AAOOOOOMOOONOOBB
we we take the first 10 after the As.
Same for
AAMOOOOOONOOOOBB
MNAAOOOOOOOOOOBB
AMAOOOOOONOOOOBB
AAOOOOOOOOOOBBMN
I think we have to track the insertion position of the model resource Ids. Collections.binarySearch() would give the insert position.
There is a further complication when the search does not return size hits and you have
AAOOOOOOOOOO
or
AAOOOO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no BB at the end. moving back from by 2 and increasing size by 2 simply adds two new items to the start because it moves the page view back but it will still include the end of the initially desired page.
If that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me take your example and run with it. Assuming size is 10.
10 ids and 2 models
Since we decrease from by two, we need to increase the size by two to fill out the page. If we didn't we would end up with
AAOOOOOOOO (2 As, 8 Os)
But, we increase the size to capture those two Os and our search from the docs returns
AAOOOOOOOOOO
If the resource models are inserted middle (ids M & N) we have an ordering like:
AAOOOOOMOOONOO
since our size is 14 in this case, we trim the first two As and then the last two O
Returning to the user
OOOOOMOOON
Here are the results for the rest of your example insertion points
- MOOOOOONOO
- AAOOOOOOOO
- AOOOOOONOO
- OOOOOOOOOO
I think 1 and 4 are obvious OK situations.
2 and 3 may seem weird off hand, but they are OK as well, since M or N would have been at the END of the previous page and "pushed" these two buffered IDS onto this page.
I think we have to track the insertion position of the model resource Ids.
I don't think this is possible since we don't know all the page boundries (without the buffers) and knowing the insertion point does not add anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. From those examples it seems once you have taken the 2 sorted sets and merged them together there are 2 conditions to consider.
from == 0in which case we remove from the back of the set untilallFoundIds.size <= pageparams.size
In this case there are no As to remove so we trim the set to size from the back
from > 0, remove the number of resource models from the front of the set - this is the As or another id that displaced the As. Then trim the set to size from the back.
With those observations the code could be simplified to
static Set<String> collectIds(PageParams pageParams, Set<String> foundFromResources, Set<String> foundFromDocs) {
// If there are no matching resource models, there was no buffering and the models from the docs
// are paginated correctly.
if (foundFromResources.isEmpty()) {
return foundFromDocs;
}
TreeSet<String> allFoundIds = new TreeSet<>(foundFromDocs);
allFoundIds.addAll(foundFromResources);
if (pageParams.getFrom() > 0) {
// not the first page so there will be extra results at the front to remove
int numToTrimFromFront = Math.min(foundFromResources.size(), pageParams.getFrom());
for (int i = 0; i < numToTrimFromFront; i++) {
allFoundIds.remove(allFoundIds.first());
}
}
// trim down to size removing from the rear
while (allFoundIds.size() > pageParams.getSize()) {
allFoundIds.remove(allFoundIds.last());
}
return allFoundIds;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidkyle wrote some more test cases, and this solution works great. I modified my code :D
davidkyle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for iterating on this
…c#51573) This adds logic to handle paging problems when the ID pattern + tags reference models stored as resources. Most of the complexity comes from the issue where a model stored as a resource could be at the start, or the end of a page or when we are on the last page.
#51736) This adds logic to handle paging problems when the ID pattern + tags reference models stored as resources. Most of the complexity comes from the issue where a model stored as a resource could be at the start, or the end of a page or when we are on the last page.
This adds logic to handle paging problems when the ID pattern + tags reference models stored as resources.
Most of the complexity comes from the issue where a model stored as a resource could be at the start, or the end of a page or when we are on the last page.
closes #51543