Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

This PR replaces the memcpy in a loop for the synchronization between GPUs with a single memcpy2D call. For prompt processing this is much faster:

GPU Model Test t/s 1x P40 t/s master t/s PR Speedup
3x P40 7b q4_0 tg128 47.41 43.45 43.53 1
3x P40 13b q4_0 tg128 26.30 29.83 29.89 1
3x P40 33b q4_0 tg128 11.51 15.37 15.38 1
3x P40 7b q4_0 pp 454.89 122.02 346.80 2.84
3x P40 13b q4_0 pp 258.17 85.05 206.76 2.43
3x P40 33b q4_0 pp 104.97 46.42 99.17 2.14

For small models multiple fast GPUs seem to still be slower than a single fast GPU due to the synchronization overhead.

@JohannesGaessler JohannesGaessler merged commit 9baf9ef into ggml-org:master Jul 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants