-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
feature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality supporttriagedIssue has been triaged by maintainersIssue has been triaged by maintainers
Description
Imagine two requests sharing same prefix comes to the trt-llm generation runtime, it would be awesome if those requests can share key value cache blocks between each other.
During generation with paged attention in fact each block corresponds to particular prefix of the input, so all of the blocks can be made prefix addressable (i.e. it should be possible to keep a map from prefix to block, or from additional prefix and previous block to a new block). This way automatic reuse of blocks between generation requests with shared prefixes will be done "out of box".
This potentially can be beneficial for speculative execution as well.
Metadata
Metadata
Assignees
Labels
feature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality supporttriagedIssue has been triaged by maintainersIssue has been triaged by maintainers