-
Notifications
You must be signed in to change notification settings - Fork 13.2k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Make a signle rpc-server able to handle multiple devices (GPUs). That way it could pass data directly from one device to the next when applicable.
Motivation
Goal: fully utilize the interconnect (PCIe) available on the rpc-server machine.
As of now (version b6084) you must launch one rpc-server instance per device (typically GPU) on your inference server.
If you happen to have 2 or more devices this results in a sub-optimal usage of the available interconnects (e.g. PCIe).
As far as I can see, even when devices on the rpc-server machine host contiguous model layers, all communication between them is still going through the network, to the client (llama-cli) and back.
I guess this was done for the sake of simplicity and ease of implementation.
There's therefore some performance gains to be made by making a single rpc-server able to handle all devices on a machine.
P.S. Thank you so much to everyone who put in the effort to build the rpc-server. It is such a great feature!
Possible Implementation
No response