Skip to content

Conversation

@davidkyle
Copy link
Member

Batch inference calls use more memory which can lead to OOM errors in extreme cases. This change iterates over the requests in a batch evaluating them one at a time.

Comparing batched to un-batched evaluation, benchmarking shows that memory usage is significantly lower and the total time for inference is similar in both cases. The benchmark data was generated with the ELSER model with different size batches. Each item in the batch contained 512 tokens. Inference Time is the time to process the entire batch whether singularly or all at once.

Num items in request Memory Max RSS (MB) Batched Memory Max RSS (MB) Inference Time (ms) Batched Inference Time (ms)
0 946 943 0 0
10 2605 1219 5022 5309
20 4237 1234 9717 9478
30 5960 1239 14434 14408
40 6032 1251 19902 19396
50 6616 1257 24853 24112

Co-authored-by: David Roberts <[email protected]>
Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants