Skip to content

Conversation

kalomaze
Copy link
Contributor

I pointed out in #5073 that there seems to be some not completely insignificant overhead with current sampling operations:

image

With the change I made in this PR, I managed to reduce 15ms per token (for sampling over the whole initial vocabulary) to 11ms per token:

llama_print_timings: sample time = 11927.89 ms / 1024 runs ( 11.65 ms per token, 85.85 tokens per second)

Efficient top k sorting is probably next.

Number of sampleable candidates can never be zero in the first place
@kalomaze
Copy link
Contributor Author

kalomaze commented Jan 22, 2024

image

...are my eyes deceiving me or did the prompt evaluation speed positively improve as a result of this? Shouldn't that phase be disconnected from the sample softmax function entirely? Or perhaps it's some weird CPU scheduling thing (despite this being full offload)

@kalomaze
Copy link
Contributor Author

On default sampler settings, due to the Repetition Penalty being set to 1.1 (even with this PR), about 10% of generation time on the q8_0 Mistral 7b is spent sampling.
With --repeat_penalty 1.0 --temp 1.0 --top_k 1, this can be reduced to 0.2% of the total generation time.

On top of this, if I change the default sampler order so that top k comes 2nd to last, a low top k number does not help the sampling efficiency at all.

So the rep pen implementation + some form of sorting that's happening (after top k presumably?) appear to be the current bottlenecks.

const int64_t t_start_sample_us = ggml_time_us();

k = std::max(k, (int) min_keep);
k = std::min(k, (int) candidates->size);
Copy link
Contributor Author

@kalomaze kalomaze Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm realizing now that these sanity checks will not run if we return early. Does that matter at all / should I revert the changes of the "Standardize top k" commit (I assume they are not actually necessary)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they need to remain - simply move them before the "if sorted" early return at the top

@ggerganov
Copy link
Member

Which change exactly results in the improvement? Most of these are noops, except the temperature check - but I don't think it can amount to that amount

@kalomaze
Copy link
Contributor Author

Which change exactly results in the improvement? Most of these are noops, except the temperature check - but I don't think it can amount to that amount

Calculating the exp and sum in one pass, in the softmax function

@ggerganov
Copy link
Member

I don't get it - it's the same thing before and after, except a temporary variable. It cannot have any effect on the performance

@kalomaze
Copy link
Contributor Author

kalomaze commented Jan 22, 2024

I don't get it - it's the same thing before and after, except a temporary variable. It cannot have any effect on the performance

I might have tricked myself into thinking it did something because of a build flag / debug vs main build distinction (I started using cmake recently). Prompt eval speed is consistent if I abandon the commits and rebuild from scratch, so it's probably that.
In any case if these minor readability tweaks are still relevant I can leave the PR up, but they might be frivolous

@kalomaze
Copy link
Contributor Author

kalomaze commented Jan 22, 2024

Ok, this is odd.

If topk=31999:
llama_print_timings: sample time = 5220.53 ms / 512 runs ( 10.20 ms per token, 98.07 tokens per second)
If topk=32000:
llama_print_timings: sample time = 8418.14 ms / 512 runs ( 16.44 ms per token, 60.82 tokens per second)

1.6x faster if we sort almost the whole vocabulary, but not exactly (on mainline, not this PR)

Comment on lines 8186 to +8190
const int64_t t_start_sample_us = ggml_time_us();

if (temp == 1.0f) {
return; // No adjustment needed as dividing by 1 leaves the values unchanged
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should make it consistent and return before ggml_time_us() like in the other calls

Comment on lines +8039 to 8043

const int64_t t_start_sample_us = ggml_time_us();

llama_sample_softmax(ctx, candidates);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will count the llama_sample_softmax() time 2 times. Either change to llama_sample_softmax(nullptr, candidates); or keep it before the ggml_time_us() call


const int64_t t_start_sample_us = ggml_time_us();

llama_sample_softmax(ctx, candidates);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about counting the time 2 times

const int64_t t_start_sample_us = ggml_time_us();

// Compute the softmax of logits and calculate entropy
llama_sample_softmax(nullptr, candidates);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK because it does not take a context

const int64_t t_start_sample_us = ggml_time_us();

k = std::max(k, (int) min_keep);
k = std::min(k, (int) candidates->size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they need to remain - simply move them before the "if sorted" early return at the top

@kalomaze
Copy link
Contributor Author

kalomaze commented Jan 22, 2024

image
With Repetition Penalty on and topk=32000 (so, the worst case scenario), ~30% of my generation time is spent on sampling for 7b q8_0.
I'm thinking that the ideal solution may be to move the sampling operations to GPU to minimize idle time, though that would be outside of my current skillset

@ggerganov
Copy link
Member

But what's the point of topk = 32000 - it's a noop since all candidates remain

@kalomaze
Copy link
Contributor Author

kalomaze commented Jan 22, 2024

But what's the point of topk = 32000 - it's a noop since all candidates remain

topk=32000 is equivalent in terms of speed to topk=0 (which defaults to the full vocabulary in the same way), so something is presumably up with how the softmax function sorts / does sort checks later on? No clue where the slowdown comes from

image

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 22, 2024

I didn't notice this was already in work.
I found kalomaze's post strange as in 100 times slowness, it's the top-k sorting not the softmax.

80-90% of the large-k latency comes from sorting/partial sorting. Anything that comes after is almost irrelevant.
Check my PR, it's about 30% faster - maybe more for a slower CPU
#5085

The real way forward imho is a dyn-k function that takes in the sampling settings and estimates a proper top-k, then the partial sort will cause a 2-3 times reduction in latency. In my tests it's down to 150-200 microseconds.

@kalomaze
Copy link
Contributor Author

I believe this PR is made redundant by the other work done on topk/minp sorting optimizations.

@kalomaze kalomaze closed this Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants