CUDA: faster multi GPU synchronization #2448

JohannesGaessler · 2023-07-29T19:32:08Z

This PR replaces the memcpy in a loop for the synchronization between GPUs with a single memcpy2D call. For prompt processing this is much faster:

GPU	Model	Test	t/s 1x P40	t/s master	t/s PR	Speedup
3x P40	7b q4_0	tg128	47.41	43.45	43.53	1
3x P40	13b q4_0	tg128	26.30	29.83	29.89	1
3x P40	33b q4_0	tg128	11.51	15.37	15.38	1
3x P40	7b q4_0	pp	454.89	122.02	346.80	2.84
3x P40	13b q4_0	pp	258.17	85.05	206.76	2.43
3x P40	33b q4_0	pp	104.97	46.42	99.17	2.14

For small models multiple fast GPUs seem to still be slower than a single fast GPU due to the synchronization overhead.

CUDA: faster multi GPU synchronization

d641b80

slaren approved these changes Jul 29, 2023

View reviewed changes

JohannesGaessler merged commit 9baf9ef into ggml-org:master Jul 29, 2023

JohannesGaessler mentioned this pull request Dec 25, 2023

cuda : fix vmm pool with multi GPU #4620

Merged

Provide feedback