- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
CUDA: General GEMV fusion #16715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: General GEMV fusion #16715
Conversation
c0a69df    to
    22ee634      
    Compare
  
    | @ggerganov after #16649 and this PR, tg for gpt-oss models should increase by ~9-10% | 
| Curious but how much does this increase binary size for the cuda backend? | 
| 
 It increases about ~20% (from 30M to 36M on my machine) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do performance testing on either Friday or Saturday when (hopefully) I'll finally be able to get the RTX 5090 that NVIDIA sent me to work.
| Regarding binary size: when I compile the CUDA backend with  In any case, for MMVF we can shave off a bit independently of this PR by only compiling it for cases not covered by MMF. | 
| Since the main-use is ncols=1,I am also okay in just doing fusion for that case. | 
| That would I think also be fine. Matrix multiplications with small batch sizes > 1 are relevant for batched inference throughput and speculative decoding but we can always revisit those cases later. | 
a6e0d34    to
    9b95697      
    Compare
  
    | Simplified the code to just fuse on ncols_dst = 1, now binary size and compilation time should be mostly unaffected with this change | 
| const int blocks_per_row_x = ncols_x / qk; | ||
| constexpr int blocks_per_iter = vdr * nwarps*warp_size / qi; | ||
|  | ||
| // The MUL_MAT_ID code path with ids != nullptr is only implemented for ncols_dst == 1. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why you're removing comments such as this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked codex to do this and it seems to have removed comments, possibly to compress it's context window. I added them back
6614a9b    to
    65a098f      
    Compare
  
    | When I tested performance: 
 On the P40 the fused MMVQ kernel does not seem to be consistently faster so I would suggest enabling fusion of that kernel only for Volta and newer. | 
| Thanks for testing! | 
This is a follow up to #16630. This PR adds ability to fuse the following common GEMV operations:
It uses a template bool to determine if we are in the fusion path, then does runtime checks for which fusion path to take. This PR also splits up
mmvq(by type) andmmvf(byncols-dst) as their compile times were becoming large after this change. This change helps TG (which is IO bound) to almost all class of models. Apart from adding tests totest-backend-opsI also spot-checked perplexity on a couple of models and it is unchanged by this change.Tested on 6x 4090