Skip to content

Bug when prompt stored in --prompt-cache is longer than the new one #1585

@aleksusklim

Description

@aleksusklim

Edit: @DannyDaemonic here. (Sorry for the edit.) Here's a much simpler example. First build the cache like this:

./main -m /path/to/llama/bin --prompt-cache Z.cache -p "Here's a funny joke. No, wait, here's a bunch of Zs: Z Z Z Z Z Z Z Z Z Z"

Then try this:

./main -m /path/to/llama/bin --prompt-cache Z.cache -p "Here's a funny joke."

The joke will start with a Z every time. Perhaps the logits are not being reevaluated for some reason. Changing one token, even the very last, seems to work around the bug. The fix is to recalculate the logits.


What happens when a prompt stored in --prompt-cache is "longer" than the current one?

I want to make a wrapper around main.exe, but this behavior looks strange and buggy. I don't know whether it is a bug or not, but it looks very-very confusing. I may post it as a separate issue if that's really a bug that you cannot solve in this PR.

So, here are my steps: (on version ee96541 and this model)

main.exe -t 6 -m WizardLM-7B-uncensored.ggml.q5_1.bin -c 512 --temp 0 --repeat_penalty 1.2 --prompt-cache cache1 -f in.txt >out.txt
(note the cache file and zero temperature)

My prompt inside in.txt is this:

### Instruction:

Describe your last dream in a few words.

### Response:

My

The model outputs this text inside out.txt (with an extra space before the first line, but let's assume I stripped it manually):

### Instruction:

Describe your last dream in a few words.

### Response:

My last dream was about being chased by a group of people through a forest, but I eventually found safety inside a church.

It also creates prompt-cache file cache1 with size around 13 Mb, and writes to stderr:

main: attempting to load saved session from 'cache1'
main: session file does not exist, will create

If I repeat the same command as-is – it recreates the same text and does not update the cache file, with stderr being:

main: attempting to load saved session from 'cache1'
main: loaded a session with prompt size of 26 tokens
main: session file has exact match for prompt!

Then I copy "cache1" file to cache2 file. I also put the resulting text back into in.txt, this time cutting it after the last comma, so it becomes:

### Instruction:

Describe your last dream in a few words.

### Response:

My last dream was about being chased by a group of people through a forest,

Then I run, pointing to cache2: (which is the same as cache1 for now)
main.exe -t 6 -m WizardLM-7B-uncensored.ggml.q5_1.bin -c 512 --temp 0 --repeat_penalty 1.2 --prompt-cache cache2 -f in.txt >out.txt

It gives the exact same line, continuing with but I eventually found safety inside a church. and stopping as before.
But this time, cache2 is updated to 21 Mb. So far so good!
Its stderr said:

main: attempting to load saved session from 'cache2'
main: loaded a session with prompt size of 26 tokens
main: session file matches 26 / 42 tokens of prompt

Finally, I copy cache2 to cache3 and cut the prompt back to My just as my very first in.txt contents.
I run, pointing to cache3, and get the following:

### Instruction:

Describe your last dream in a few words.

### Response:

My butterfly had wings of fire and flew through the night sky while I chased after it, trying to catch it so that we could be together forever.

File cache3 stays binary equal to cache2, and the program outputs to stderr this lines among the others:

main: attempting to load saved session from 'cache3'
main: loaded a session with prompt size of 42 tokens
main: session file has exact match for prompt!

What just happened? For me, it looks like the program compared for the head of the prompt discarding its cached tail; then it assumed that the cache is valid (despite it is not) and continued the generation from the cache, ending up in a broken state.

My questions:

  1. Is it possible to "go back" for any amounts of tokens in the cache? Can we cut the prompt in arbitrary place and keep the cache valid? Or is it only valid "after the final token", so we can go forth but cannot start earlier?
  2. If we cannot go back, is it possible to "restore text" from the cache and output it explicitly? So the program would show, what we have in the cache, and what is causing its weird answers.
  3. If yes, can we store "checkpoints" in the cache with full intermediate state, so that in my example after seeing "My" at the end – it would use the older state that it already had; or if I'll stop at "My last dream" then it continues up to "was about being chased by a group of people through a forest," (since that state was also recorded) and not break there?
  4. If yes, why can't we "discard all of extra states" upon seeking, for example, "My last dream was dark" – backtrack to "My" (which is stored), then evaluate just "last dream was dark" and generate from here? Just as If I put a copy of an older cache myself.
  5. Can't it at least throw an error with mismatching cache (or discard it and regenerate from scratch) when it detects that the prompt is different?
  6. Can we specify "do not update cache" command-line option? Or "write new cache to the specified file" (which will also accept NUL to skip it) so it will not destroy the previous state? Currently my only option is to copy the cache file each time before invoking the program, which delays startup because of OS filesystem cache eviction when the model barely fits in RAM already.

And a few minor things:
– Why not to run the evaluation with --n-predict 0 if I only want to cache the prompt? I have to specify --n-predict 1 to just cache the prompt, but that prevents me to use (probably reparsed?) output file since it will contain an extra word at the end. I have to use the initial file.
– Why to print an extra space at the beginning of stdout? If I swap files (giving out.txt to in.txt) later, it will grow each time (two spaces, three spaces…) and most likely destroys the cache. I had to strip that manually, but this feels odd.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions