feat: Enable multi-token drafting for GLM-4.5 MTP #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For now, this is just a proof of concept I had in mind. I tried to replicate the workflow from SGLang because even with a proper implementation in vLLM, they are not drafting more than one token, as stated in this comment on their PR.
The drafting process must be an autoregressive loop to accommodate the CPU-side sampling required between each token generation. In case you are curious, the loop in SGLang looks like this:
With that in mind, I implemented a loop for
mtp_speculative_gen_draft
based on how many draft tokens you want at once, which gives us a similar idea.The major problem is a significant limitation regarding KV cache management; the KV cache is not persisted between draft steps, leading to a degradation in draft quality as more tokens are generated.
Here are some preliminary results showing the drop in acceptance rate (It's hardcoded for now to run 5 drafts at once in the line const int n_mtp_draft_target = 5):
If my concept is correct, then we "just" need to fix the KV cache, and here is the nightmare that you probably walked before. A proper solution would likely involve creating a persistent KV cache context for the duration of the draft loop.
I can see two options to follow:
server: implement GLM-style MTP
PR is merged).