Skip to content

Feature Request: Support multiple devices on a single rpc-server #15210

@nguha

Description

@nguha

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Make a signle rpc-server able to handle multiple devices (GPUs). That way it could pass data directly from one device to the next when applicable.

Motivation

Goal: fully utilize the interconnect (PCIe) available on the rpc-server machine.

As of now (version b6084) you must launch one rpc-server instance per device (typically GPU) on your inference server.
If you happen to have 2 or more devices this results in a sub-optimal usage of the available interconnects (e.g. PCIe).
As far as I can see, even when devices on the rpc-server machine host contiguous model layers, all communication between them is still going through the network, to the client (llama-cli) and back.

I guess this was done for the sake of simplicity and ease of implementation.
There's therefore some performance gains to be made by making a single rpc-server able to handle all devices on a machine.

P.S. Thank you so much to everyone who put in the effort to build the rpc-server. It is such a great feature!

Possible Implementation

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions