-
Notifications
You must be signed in to change notification settings - Fork 13.2k
Small sampling optimizations #5074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Number of sampleable candidates can never be zero in the first place
On default sampler settings, due to the Repetition Penalty being set to 1.1 (even with this PR), about 10% of generation time on the q8_0 Mistral 7b is spent sampling. On top of this, if I change the default sampler order so that top k comes 2nd to last, a low top k number does not help the sampling efficiency at all. So the rep pen implementation + some form of sorting that's happening (after top k presumably?) appear to be the current bottlenecks. |
const int64_t t_start_sample_us = ggml_time_us(); | ||
|
||
k = std::max(k, (int) min_keep); | ||
k = std::min(k, (int) candidates->size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm realizing now that these sanity checks will not run if we return early. Does that matter at all / should I revert the changes of the "Standardize top k" commit (I assume they are not actually necessary)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they need to remain - simply move them before the "if sorted" early return at the top
Which change exactly results in the improvement? Most of these are noops, except the temperature check - but I don't think it can amount to that amount |
Calculating the exp and sum in one pass, in the softmax function |
I don't get it - it's the same thing before and after, except a temporary variable. It cannot have any effect on the performance |
I might have tricked myself into thinking it did something because of a build flag / debug vs main build distinction (I started using cmake recently). Prompt eval speed is consistent if I abandon the commits and rebuild from scratch, so it's probably that. |
Ok, this is odd. If topk=31999: 1.6x faster if we sort almost the whole vocabulary, but not exactly (on mainline, not this PR) |
const int64_t t_start_sample_us = ggml_time_us(); | ||
|
||
if (temp == 1.0f) { | ||
return; // No adjustment needed as dividing by 1 leaves the values unchanged | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should make it consistent and return before ggml_time_us()
like in the other calls
|
||
const int64_t t_start_sample_us = ggml_time_us(); | ||
|
||
llama_sample_softmax(ctx, candidates); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will count the llama_sample_softmax()
time 2 times. Either change to llama_sample_softmax(nullptr, candidates);
or keep it before the ggml_time_us()
call
|
||
const int64_t t_start_sample_us = ggml_time_us(); | ||
|
||
llama_sample_softmax(ctx, candidates); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about counting the time 2 times
const int64_t t_start_sample_us = ggml_time_us(); | ||
|
||
// Compute the softmax of logits and calculate entropy | ||
llama_sample_softmax(nullptr, candidates); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK because it does not take a context
const int64_t t_start_sample_us = ggml_time_us(); | ||
|
||
k = std::max(k, (int) min_keep); | ||
k = std::min(k, (int) candidates->size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they need to remain - simply move them before the "if sorted" early return at the top
But what's the point of |
topk=32000 is equivalent in terms of speed to topk=0 (which defaults to the full vocabulary in the same way), so something is presumably up with how the softmax function sorts / does sort checks later on? No clue where the slowdown comes from |
I didn't notice this was already in work. 80-90% of the large-k latency comes from sorting/partial sorting. Anything that comes after is almost irrelevant. The real way forward imho is a dyn-k function that takes in the sampling settings and estimates a proper top-k, then the partial sort will cause a 2-3 times reduction in latency. In my tests it's down to 150-200 microseconds. |
I believe this PR is made redundant by the other work done on topk/minp sorting optimizations. |
I pointed out in #5073 that there seems to be some not completely insignificant overhead with current sampling operations:
With the change I made in this PR, I managed to reduce 15ms per token (for sampling over the whole initial vocabulary) to 11ms per token:
llama_print_timings: sample time = 11927.89 ms / 1024 runs ( 11.65 ms per token, 85.85 tokens per second)
Efficient top k sorting is probably next.