-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
feature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality supporttriagedIssue has been triaged by maintainersIssue has been triaged by maintainerswaiting for feedback
Description
There's a new cache technique mentioned in the paper https://arxiv.org/abs/2312.17238. (github: https://github.com/dvmazur/mixtral-offloading)
They introduced LRU cache to cache experts based on patterns they found, and also took speculative guess to pre-load experts before the computation of the next layer. The result looks quite promising. Can we support it for Mixtral? This helps a lot to run on smaller GPUs.
Metadata
Metadata
Assignees
Labels
feature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality supporttriagedIssue has been triaged by maintainersIssue has been triaged by maintainerswaiting for feedback