Skip to content

Conversation

ggerganov
Copy link
Member

target #14756

Relax the requirement for contiguously allocated K/V buffers in the quantized case.

I am not 100% this is the most optimal solution in terms of memory usage, but at least the results are OK now.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 22, 2025
@ggerganov ggerganov mentioned this pull request Jul 22, 2025
23 tasks
@JohannesGaessler
Copy link
Collaborator

I have something WIP to fix this by adding non-contiguous support to the dequantization kernels. There is still a bug somewhere, I'll try to make a PR this evening.

I am not 100% this is the most optimal solution in terms of memory usage, but at least the results are OK now.

This is more an issue with kernel launch overhead because you're launching one kernel per sequence + each kernel will have poor hardware utilization.

@ggerganov
Copy link
Member Author

Replaced by #14822

@ggerganov ggerganov closed this Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants