[Feature Request] Mixtral Offloading

There's a new cache technique mentioned in the paper https://arxiv.org/abs/2312.17238. (github: https://github.com/dvmazur/mixtral-offloading)
They introduced LRU cache to cache experts based on patterns they found, and also took speculative guess to pre-load experts before the computation of the next layer. The result looks quite promising. Can we support it for Mixtral? This helps a lot to run on smaller GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Mixtral Offloading #849

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Mixtral Offloading #849

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions