You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/server/README.md
+43-17Lines changed: 43 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -249,23 +249,49 @@ logging:
249
249
250
250
Available environment variables (if specified, these variables will override parameters specified in arguments):
251
251
252
-
-`LLAMA_CACHE` (cache directory, used by `--hf-repo`)
253
-
-`HF_TOKEN` (Hugging Face access token, used when accessing a gated model with `--hf-repo`)
254
-
-`LLAMA_ARG_MODEL`
255
-
-`LLAMA_ARG_THREADS`
256
-
-`LLAMA_ARG_CTX_SIZE`
257
-
-`LLAMA_ARG_N_PARALLEL`
258
-
-`LLAMA_ARG_BATCH`
259
-
-`LLAMA_ARG_UBATCH`
260
-
-`LLAMA_ARG_N_GPU_LAYERS`
261
-
-`LLAMA_ARG_THREADS_HTTP`
262
-
-`LLAMA_ARG_CHAT_TEMPLATE`
263
-
-`LLAMA_ARG_N_PREDICT`
264
-
-`LLAMA_ARG_ENDPOINT_METRICS`
265
-
-`LLAMA_ARG_ENDPOINT_SLOTS`
266
-
-`LLAMA_ARG_EMBEDDINGS`
267
-
-`LLAMA_ARG_FLASH_ATTN`
268
-
-`LLAMA_ARG_DEFRAG_THOLD`
252
+
-`LLAMA_CACHE`: cache directory, used by `--hf-repo`
253
+
-`HF_TOKEN`: Hugging Face access token, used when accessing a gated model with `--hf-repo`
254
+
-`LLAMA_ARG_MODEL`: equivalent to `-m`
255
+
-`LLAMA_ARG_MODEL_URL`: equivalent to `-mu`
256
+
-`LLAMA_ARG_MODEL_ALIAS`: equivalent to `-a`
257
+
-`LLAMA_ARG_HF_REPO`: equivalent to `--hf-repo`
258
+
-`LLAMA_ARG_HF_FILE`: equivalent to `--hf-file`
259
+
-`LLAMA_ARG_THREADS`: equivalent to `-t`
260
+
-`LLAMA_ARG_CTX_SIZE`: equivalent to `-c`
261
+
-`LLAMA_ARG_N_PARALLEL`: equivalent to `-np`
262
+
-`LLAMA_ARG_BATCH`: equivalent to `-b`
263
+
-`LLAMA_ARG_UBATCH`: equivalent to `-ub`
264
+
-`LLAMA_ARG_N_GPU_LAYERS`: equivalent to `-ngl`
265
+
-`LLAMA_ARG_THREADS_HTTP`: equivalent to `--threads-http`
266
+
-`LLAMA_ARG_CHAT_TEMPLATE`: equivalent to `--chat-template`
267
+
-`LLAMA_ARG_N_PREDICT`: equivalent to `-n`
268
+
-`LLAMA_ARG_ENDPOINT_METRICS`: if set to `1`, it will enable metrics endpoint (equivalent to `--metrics`)
269
+
-`LLAMA_ARG_ENDPOINT_SLOTS`: if set to `0`, it will **disable** slots endpoint (equivalent to `--no-slots`). This feature is enabled by default.
270
+
-`LLAMA_ARG_EMBEDDINGS`: if set to `1`, it will enable embeddings endpoint (equivalent to `--embeddings`)
271
+
-`LLAMA_ARG_FLASH_ATTN`: if set to `1`, it will enable flash attention (equivalent to `-fa`)
272
+
-`LLAMA_ARG_CONT_BATCHING`: if set to `0`, it will **disable** continuous batching (equivalent to `--no-cont-batching`). This feature is enabled by default.
273
+
-`LLAMA_ARG_DEFRAG_THOLD`: equivalent to `-dt`
274
+
-`LLAMA_ARG_HOST`: equivalent to `--host`
275
+
-`LLAMA_ARG_PORT`: equivalent to `--port`
276
+
277
+
Example usage of docker compose with environment variables:
278
+
279
+
```yml
280
+
services:
281
+
llamacpp-server:
282
+
image: ghcr.io/ggerganov/llama.cpp:server
283
+
ports:
284
+
- 8080:8080
285
+
volumes:
286
+
- ./models:/models
287
+
environment:
288
+
# alternatively, you can use "LLAMA_ARG_MODEL_URL" to download the model
289
+
LLAMA_ARG_MODEL: /models/my_model.gguf
290
+
LLAMA_ARG_CTX_SIZE: 4096
291
+
LLAMA_ARG_N_PARALLEL: 2
292
+
LLAMA_ARG_ENDPOINT_METRICS: 1# to disable, either remove or set to 0
0 commit comments