Commit bf83bff
authored
metal : matrix-matrix multiplication kernel (#2615)
* metal: matrix-matrix multiplication kernel
This commit removes MPS and uses custom matrix-matrix multiplication
kernels for all quantization types. This commit also adds grouped-query
attention to support llama2 70B.
* metal: fix performance degradation from gqa
Integers are slow on the GPU, and 64-bit divides are extremely slow.
In the context of GQA, we introduce a 64-bit divide that cannot be
optimized out by the compiler, which results in a decrease of ~8% in
inference performance. This commit fixes that issue by calculating a
part of the offset with a 32-bit divide. Naturally, this limits the
size of a single matrix to ~4GB. However, this limitation should
suffice for the near future.
* metal: fix bugs for GQA and perplexity test.
I mixed up ne02 and nb02 in previous commit.1 parent b5ffb28 commit bf83bff
File tree
6 files changed
+528
-636
lines changed6 files changed
+528
-636
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
296 | 296 | | |
297 | 297 | | |
298 | 298 | | |
299 | | - | |
300 | 299 | | |
301 | 300 | | |
302 | 301 | | |
| |||
313 | 312 | | |
314 | 313 | | |
315 | 314 | | |
316 | | - | |
317 | 315 | | |
318 | 316 | | |
319 | 317 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
283 | 283 | | |
284 | 284 | | |
285 | 285 | | |
286 | | - | |
| 286 | + | |
287 | 287 | | |
288 | 288 | | |
289 | 289 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
18 | | - | |
19 | 17 | | |
20 | 18 | | |
21 | 19 | | |
| |||
0 commit comments