-
Notifications
You must be signed in to change notification settings - Fork 13.8k
Description
Prerequisites
- [✅] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Feature Idea
Typical sampling methods for large language models, such as Top P and Top K, (as well as alternative sampler modes that decide the Top K dynamically like Mirostat) are based off the assumption that a static temperature value (a consistently randomized probability distribution) is the ideal sampler conditioning. Mirostat, most notably, was designed to 'learn' a certain targeted level of 'entropy' over time; this helped the model find the most grammatically coherent selection of tokens to be considered by the sampler for good results. Most of these sampling implementations weren't designed to be used together. Some, like TFS, were created when the largest available models were smaller ones like GPT2. Those models struggled a lot more when attempting to generalize in different directions, and it makes sense to me that they'd need unique sampler tricks to keep them grammatically coherent.
I've tested and played around with these settings for Llama models, and while Mirostat seemed like a step in the right direction, especially for preventing repetition, I realized that nobody had made a sampler mode that would control temperature directly per token. My implementation of this would be calculated based on a simple metric; take the standard deviation of all tokens being considered by your top P / top K before applying the temperature randomization, and based on the 'confidence' of the model (as represented by the variation in choice), you can apply a temperature adjustment proportional to the variation of probability seen in the sampled set of tokens being chosen from.
The main idea is to encourage randomizing 'uncertain' probabilities (e.g, open ended writing, abstract concepts that can be represented with many words, and aren't deterministic by nature) while keeping the temperature low for more deterministic tokens without having to find the ideal selection of candidates for sampling per token (which I believe is how Mirostat was designed to work).
List of possible advantages could be:
- Having a definable range between the 'Base Temperature' and 'Maximum Temperature' could generally improve the creative problem solving ability of the model.
- Certain tokens are highly important to the context and are more important than others. For example, if the probability was randomized too far for at least one token that represents something deterministic like a certain character in a programming syntax, this leads to a higher failure rate for the rest of the generation.
- Could help prevent the model's generations from trending towards repetition due to a much broader range of probabilities that could be considered without impacting the model's intelligence as broadly (e.g a max temperature of 1.5 might not impact the model as strongly compared to if every token was sampled with that value). If this is the case, biasing against repeated tokens artificially through the Repetition Penalty would become less necessary.
List of possible disadvantages could be:
- A lot of faith is being put in the idea that strong variations of possibilities have a correlation with a high amount of acceptable / reasonable tokens. If the correlation is mild, the default range values would have to be adjusted to accomodate this, but that could be mitigated by testing different values for the base/max temp range, or through benchmarking them individually.
- The rate at which a model becomes more certain might not be linear; there might be a very short gap between 'low deviation' and 'high deviation' on unsampled probabilities.
- Reproducability might be more difficult, but I'm unsure of this. I'm guessing you could just use the same seed for every temperature value variation.