You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
315
+
For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
316
+
317
+
The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
318
+
The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads.
319
+
312
320
### Interactive mode
313
321
314
322
If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.
@@ -407,26 +415,6 @@ If your issue is with model generation quality, then please at least scan the fo
407
415
-[Aligning language models to follow instructions](https://openai.com/research/instruction-following)
408
416
-[Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
409
417
410
-
### Perplexity (measuring model quality)
411
-
412
-
You can use the `perplexity` example to measure perplexity over the given prompt. For more background, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity). However, in general, lower perplexity is better for LLMs.
413
-
414
-
#### Latest measurements
415
-
416
-
The latest perplexity scores for the various model sizes and quantizations are being tracked in [discussion #406](https://github.com/ggerganov/llama.cpp/discussions/406). `llama.cpp` is measuring very well compared to the baseline implementations. Quantization has a small negative impact on quality, but, as you can see, running
417
-
13B at q4_0 beats the 7B f16 model by a significant amount.
418
-
419
-
All measurements are done against the wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context).
420
-
Note that changing the context length will have a significant impact on perplexity (longer context = better perplexity).
0 commit comments