Small sampling optimizations #5074

kalomaze · 2024-01-22T09:08:26Z

I pointed out in #5073 that there seems to be some not completely insignificant overhead with current sampling operations:

With the change I made in this PR, I managed to reduce 15ms per token (for sampling over the whole initial vocabulary) to 11ms per token:

llama_print_timings: sample time = 11927.89 ms / 1024 runs ( 11.65 ms per token, 85.85 tokens per second)

Efficient top k sorting is probably next.

Number of sampleable candidates can never be zero in the first place

kalomaze · 2024-01-22T10:00:20Z

...are my eyes deceiving me or did the prompt evaluation speed positively improve as a result of this? Shouldn't that phase be disconnected from the sample softmax function entirely? Or perhaps it's some weird CPU scheduling thing (despite this being full offload)

kalomaze · 2024-01-22T10:45:04Z

On default sampler settings, due to the Repetition Penalty being set to 1.1 (even with this PR), about 10% of generation time on the q8_0 Mistral 7b is spent sampling.
With --repeat_penalty 1.0 --temp 1.0 --top_k 1, this can be reduced to 0.2% of the total generation time.

On top of this, if I change the default sampler order so that top k comes 2nd to last, a low top k number does not help the sampling efficiency at all.

So the rep pen implementation + some form of sorting that's happening (after top k presumably?) appear to be the current bottlenecks.

kalomaze · 2024-01-22T12:02:28Z

llama.cpp

+    const int64_t t_start_sample_us = ggml_time_us();
+
    k = std::max(k, (int) min_keep);
    k = std::min(k, (int) candidates->size);


I'm realizing now that these sanity checks will not run if we return early. Does that matter at all / should I revert the changes of the "Standardize top k" commit (I assume they are not actually necessary)?

I think they need to remain - simply move them before the "if sorted" early return at the top

ggerganov · 2024-01-22T12:10:56Z

Which change exactly results in the improvement? Most of these are noops, except the temperature check - but I don't think it can amount to that amount

kalomaze · 2024-01-22T12:12:07Z

Which change exactly results in the improvement? Most of these are noops, except the temperature check - but I don't think it can amount to that amount

Calculating the exp and sum in one pass, in the softmax function

ggerganov · 2024-01-22T12:13:15Z

I don't get it - it's the same thing before and after, except a temporary variable. It cannot have any effect on the performance

kalomaze · 2024-01-22T12:18:49Z

I don't get it - it's the same thing before and after, except a temporary variable. It cannot have any effect on the performance

I might have tricked myself into thinking it did something because of a build flag / debug vs main build distinction (I started using cmake recently). Prompt eval speed is consistent if I abandon the commits and rebuild from scratch, so it's probably that.
In any case if these minor readability tweaks are still relevant I can leave the PR up, but they might be frivolous

kalomaze · 2024-01-22T12:32:55Z

Ok, this is odd.

If topk=31999:
llama_print_timings: sample time = 5220.53 ms / 512 runs ( 10.20 ms per token, 98.07 tokens per second)
If topk=32000:
llama_print_timings: sample time = 8418.14 ms / 512 runs ( 16.44 ms per token, 60.82 tokens per second)

1.6x faster if we sort almost the whole vocabulary, but not exactly (on mainline, not this PR)

ggerganov · 2024-01-22T12:26:11Z

llama.cpp

    const int64_t t_start_sample_us = ggml_time_us();

+    if (temp == 1.0f) {
+        return; // No adjustment needed as dividing by 1 leaves the values unchanged
+    }


Should make it consistent and return before ggml_time_us() like in the other calls

ggerganov · 2024-01-22T12:29:24Z

llama.cpp

+
+    const int64_t t_start_sample_us = ggml_time_us();

    llama_sample_softmax(ctx, candidates);



This will count the llama_sample_softmax() time 2 times. Either change to llama_sample_softmax(nullptr, candidates); or keep it before the ggml_time_us() call

ggerganov · 2024-01-22T12:30:11Z

llama.cpp

-
    const int64_t t_start_sample_us = ggml_time_us();

+    llama_sample_softmax(ctx, candidates);


Same comment about counting the time 2 times

ggerganov · 2024-01-22T12:30:31Z

llama.cpp

+    const int64_t t_start_sample_us = ggml_time_us();
+
    // Compute the softmax of logits and calculate entropy
    llama_sample_softmax(nullptr, candidates);


This is OK because it does not take a context

ggerganov · 2024-01-22T12:32:50Z

llama.cpp

+    const int64_t t_start_sample_us = ggml_time_us();
+
    k = std::max(k, (int) min_keep);
    k = std::min(k, (int) candidates->size);


I think they need to remain - simply move them before the "if sorted" early return at the top

kalomaze · 2024-01-22T12:55:50Z

With Repetition Penalty on and topk=32000 (so, the worst case scenario), ~30% of my generation time is spent on sampling for 7b q8_0.
I'm thinking that the ideal solution may be to move the sampling operations to GPU to minimize idle time, though that would be outside of my current skillset

ggerganov · 2024-01-22T13:08:43Z

But what's the point of topk = 32000 - it's a noop since all candidates remain

kalomaze · 2024-01-22T13:11:57Z

But what's the point of topk = 32000 - it's a noop since all candidates remain

topk=32000 is equivalent in terms of speed to topk=0 (which defaults to the full vocabulary in the same way), so something is presumably up with how the softmax function sorts / does sort checks later on? No clue where the slowdown comes from

cmp-nct · 2024-01-22T19:58:44Z

I didn't notice this was already in work.
I found kalomaze's post strange as in 100 times slowness, it's the top-k sorting not the softmax.

80-90% of the large-k latency comes from sorting/partial sorting. Anything that comes after is almost irrelevant.
Check my PR, it's about 30% faster - maybe more for a slower CPU
#5085

The real way forward imho is a dyn-k function that takes in the sampling settings and estimates a proper top-k, then the partial sort will cause a 2-3 times reduction in latency. In my tests it's down to 150-200 microseconds.

kalomaze · 2024-01-28T07:50:39Z

I believe this PR is made redundant by the other work done on topk/minp sorting optimizations.

kalomaze added 2 commits January 22, 2024 03:02

Softmax exp & sum in one pass + temp returns if 1

6167c26

tiny min p return check tweak

4779d99

Number of sampleable candidates can never be zero in the first place

kalomaze added 2 commits January 22, 2024 05:11

Standardize top_k sorting

feea528

Capture softmax operations for sampler profiling

bbb578b

kalomaze commented Jan 22, 2024

View reviewed changes

ggerganov approved these changes Jan 22, 2024

View reviewed changes

kalomaze closed this Jan 28, 2024


		const int64_t t_start_sample_us = ggml_time_us();

		llama_sample_softmax(ctx, candidates);

Small sampling optimizations #5074

Small sampling optimizations #5074

Uh oh!

Conversation

kalomaze commented Jan 22, 2024

Uh oh!

kalomaze commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kalomaze commented Jan 22, 2024

Uh oh!

kalomaze Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 22, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jan 22, 2024

Uh oh!

kalomaze commented Jan 22, 2024

Uh oh!

ggerganov commented Jan 22, 2024

Uh oh!

kalomaze commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kalomaze commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov Jan 22, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 22, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 22, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 22, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 22, 2024

Choose a reason for hiding this comment

Uh oh!

kalomaze commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 22, 2024

Uh oh!

kalomaze commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmp-nct commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kalomaze commented Jan 28, 2024

Uh oh!

Uh oh!

kalomaze commented Jan 22, 2024 •

edited

Loading

kalomaze Jan 22, 2024 •

edited

Loading

kalomaze commented Jan 22, 2024 •

edited

Loading

kalomaze commented Jan 22, 2024 •

edited

Loading

kalomaze commented Jan 22, 2024 •

edited

Loading

kalomaze commented Jan 22, 2024 •

edited

Loading

cmp-nct commented Jan 22, 2024 •

edited

Loading