From 11612419d24ed30ab91d1c9db512a9d497ec607d Mon Sep 17 00:00:00 2001 From: Xiaotong Jiang Date: Thu, 4 Sep 2025 07:32:30 +0000 Subject: [PATCH 1/5] GITBOOK-19: No subject --- .../model-recipes/gpt-oss/README.md | 2 +- .../model-recipes/gpt-oss/usage-guide.md | 31 +++++++++++++++++-- 2 files changed, 30 insertions(+), 3 deletions(-) diff --git a/sglang-cookbook/model-recipes/gpt-oss/README.md b/sglang-cookbook/model-recipes/gpt-oss/README.md index 6ebeb01..2fe1e5f 100644 --- a/sglang-cookbook/model-recipes/gpt-oss/README.md +++ b/sglang-cookbook/model-recipes/gpt-oss/README.md @@ -2,7 +2,7 @@ gpt-oss-20b -
Weight TypeHardware ConfigurationInstructionBenchmark
MXFP4
(recommended)
1 x H100/H200#serving-with-1-x-h100-h200
1 x B200#serving-with-1-x-b200
1 x MI300X
Full precision FP8/BF161 x H200
+
Weight TypeHardware ConfigurationInstructionBenchmark
MXFP4
(recommended)
1 x H100/H200#serving-with-1-x-h100-h200#benchmark
1 x B200#serving-with-1-x-b200
1 x MI300X
Full precision FP8/BF161 x H200
gpt-oss-120b diff --git a/sglang-cookbook/model-recipes/gpt-oss/usage-guide.md b/sglang-cookbook/model-recipes/gpt-oss/usage-guide.md index 7511784..bc13f92 100644 --- a/sglang-cookbook/model-recipes/gpt-oss/usage-guide.md +++ b/sglang-cookbook/model-recipes/gpt-oss/usage-guide.md @@ -2,8 +2,15 @@ ### Serving with 1 x H100/H200 -1. Install SGLang following [the instruction](https://app.gitbook.com/s/FFtIWT8LEMaYiYzz0p8P/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) -2. Serve the model +{% stepper %} +{% step %} +### Install SGLang + +Following [the instruction](https://app.gitbook.com/o/TvLfyTxdRQeudJH7e5QW/s/FFtIWT8LEMaYiYzz0p8P/~/changes/11/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) +{% endstep %} + +{% step %} +### Serve the model {% code overflow="wrap" %} ```bash @@ -18,6 +25,26 @@ python3 -m sglang.launch_server --model-path openai/gpt-oss-20b python3 -m sglang.launch_server --model-path openai/gpt-oss-120b --mem-fraction-static 0.95 ``` {% endcode %} +{% endstep %} + +{% step %} +### Benchmark + +SGLang version (0.5.1) + +
# gpt-oss-20b
+python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000  --model-path openai/gpt-oss-20b --batch 1 --input-len 1024 --output-len 1024 
+
+ +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
1/1024/10240.053.2922668.19304.59
1/8192/10240.153.3955870.90295.09
8/1024/10240.125.9265760.011350.83
8/8192/10241.056.6262209.721209.10
+ +
# gpt-oss-120b
+python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000  --model-path openai/gpt-oss-120b --batch 1 --input-len 1024 --output-len 1024 
+
+ +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
1/1024/10240.074.7315803.59211.49
1/8192/10240.234.8935004.05204.75
8/1024/10240.2110.1739132.98786.63
8/8192/10241.7611.2037178.23714.53
+{% endstep %} +{% endstepper %} ### Serving with 2 x H100 From ec1f9a18d5086df0b91fcdd143a644e6ddee3145 Mon Sep 17 00:00:00 2001 From: Xiaotong Jiang Date: Thu, 4 Sep 2025 00:43:09 -0700 Subject: [PATCH 2/5] . --- .../deepseek-v3.1-v3-r1/usage-guide.md | 201 ++++++++++++++++-- .../model-recipes/gpt-oss/usage-guide.md | 42 +++- 2 files changed, 219 insertions(+), 24 deletions(-) diff --git a/sglang-cookbook/model-recipes/deepseek-v3.1-v3-r1/usage-guide.md b/sglang-cookbook/model-recipes/deepseek-v3.1-v3-r1/usage-guide.md index 3973d23..d9b9c86 100644 --- a/sglang-cookbook/model-recipes/deepseek-v3.1-v3-r1/usage-guide.md +++ b/sglang-cookbook/model-recipes/deepseek-v3.1-v3-r1/usage-guide.md @@ -2,10 +2,17 @@ ### Serving with 1 x 8 x H200 -1. Install SGLang following [the instruction](https://app.gitbook.com/s/FFtIWT8LEMaYiYzz0p8P/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) +{% stepper %} +{% step %} +### Install SGLang - Note if you are using RDMA and are using docker, `--network host` and `--privileged` are required for `docker run` command. -2. Serve the model +Following [the instruction](https://app.gitbook.com/o/TvLfyTxdRQeudJH7e5QW/s/FFtIWT8LEMaYiYzz0p8P/~/changes/11/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) + +Note if you are using RDMA and are using docker, `--network host` and `--privileged` are required for `docker run` command. +{% endstep %} + +{% step %} +### Serve the model {% code overflow="wrap" %} ```bash @@ -15,11 +22,26 @@ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-r * You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`. * [Optional Optimization Options](./#optional-performance-optimization) +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} ### Serving with 1 x 8 x MI300X -1. Install SGLang following [the instruction](../installation/amd-gpus.md) -2. Serve the model +{% stepper %} +{% step %} +### Install SGLang + +Following [the instruction](../installation/amd-gpus.md) +{% endstep %} + +{% step %} +### Serve the model {% code overflow="wrap" %} ```bash @@ -28,11 +50,26 @@ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-r {% endcode %} [Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference. +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} ### Serving with 2 x 8 x H100/800/20 -1. Install SGLang following [the instruction](https://app.gitbook.com/s/FFtIWT8LEMaYiYzz0p8P/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) for the 2 nodes -2. Serve the model +{% stepper %} +{% step %} +### Install SGLang + +Following [the instruction](https://app.gitbook.com/o/TvLfyTxdRQeudJH7e5QW/s/FFtIWT8LEMaYiYzz0p8P/~/changes/11/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) for the 2 nodes +{% endstep %} + +{% step %} +### Serve the model If the first node's IP is `10.0.0.1` , launch the server in both node with below commands @@ -49,11 +86,26 @@ python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --d * If the command fails, try setting the `GLOO_SOCKET_IFNAME` parameter. For more information, see [Common Environment Variables](https://pytorch.org/docs/stable/distributed.html#common-environment-variables). * If the multi nodes support NVIDIA InfiniBand and encounter hanging issues during startup, consider adding the parameter `export NCCL_IB_GID_INDEX=3`. For more information, see [this](https://github.com/sgl-project/sglang/issues/3516#issuecomment-2668493307). * [Optional Optimization Options](./#optional-performance-optimization) +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} ### Serving with Xeon 6980P CPU -1. Install SGLang following [the instruction](../installation/intel-xeon-cpus.md) -2. Serve the model +{% stepper %} +{% step %} +### Install SGLang + +Following [the instruction](../installation/intel-xeon-cpus.md) +{% endstep %} + +{% step %} +### Serve the model * For w8a8\_int8 @@ -83,11 +135,26 @@ python -m sglang.launch_server \ --max-total-token 65536 \ --tp 6 ``` +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} ### Serving with 2 x 8 x H200 -1. Install SGLang following [the instruction](https://app.gitbook.com/s/FFtIWT8LEMaYiYzz0p8P/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) for the 2 nodes -2. Serve the model +{% stepper %} +{% step %} +### Install SGLang + +Following [the instruction](https://app.gitbook.com/o/TvLfyTxdRQeudJH7e5QW/s/FFtIWT8LEMaYiYzz0p8P/~/changes/11/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) for the 2 nodes +{% endstep %} + +{% step %} +### Serve the model If the first node's IP is `10.0.0.1` , launch the server in both node with below commands @@ -102,12 +169,32 @@ python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --d {% endcode %} * [Optional Optimization Options](./#optional-performance-optimization) +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} ### Serving with 4 x 8 x A100 -1. Install SGLang following [the instruction](https://app.gitbook.com/s/FFtIWT8LEMaYiYzz0p8P/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) for the 4 nodes -2. As A100 does not support FP8, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first -3. Serve the model +{% stepper %} +{% step %} +### Install SGLang + +Following [the instruction](https://app.gitbook.com/o/TvLfyTxdRQeudJH7e5QW/s/FFtIWT8LEMaYiYzz0p8P/~/changes/11/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) for the 4 nodes +{% endstep %} + +{% step %} +### Convert Model Checkpoints + +As A100 does not support FP8, we need to convert the [FP8 model checkpoints](https://huggingface.co/deepseek-ai/DeepSeek-V3) to BF16 with [script](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) mentioned [here](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py) first +{% endstep %} + +{% step %} +### Serve the model If the first node's IP is `10.0.0.1` , and the converted model path is `/path/to/DeepSeek-V3-BF16`, launch the server in 4 nodes with below commands @@ -128,11 +215,31 @@ python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 - {% endcode %} * [Optional Optimization Options](./#optional-performance-optimization) +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} ### Serving with 8 x A100 +<<<<<<< HEAD 1. Install SGLang following [the instruction](https://app.gitbook.com/s/FFtIWT8LEMaYiYzz0p8P/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) 2. Serve the model +======= +{% stepper %} +{% step %} +### Install SGLang + +Following [the instruction](https://app.gitbook.com/o/TvLfyTxdRQeudJH7e5QW/s/FFtIWT8LEMaYiYzz0p8P/~/changes/11/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) +{% endstep %} + +{% step %} +### Serve the model +>>>>>>> c926237 (.) {% code overflow="wrap" %} ```bash @@ -151,11 +258,26 @@ python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ -- {% endcode %} Note that `awq_marlin` only supports `float16` now, which may lead to some precision loss. +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} ### Serving with 2 x 8 x A100/A800 -1. Install SGLang following [the instruction](https://app.gitbook.com/s/FFtIWT8LEMaYiYzz0p8P/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) for the 4 nodes -2. Serve the model +{% stepper %} +{% step %} +### Install SGLang + +Following [the instruction](https://app.gitbook.com/o/TvLfyTxdRQeudJH7e5QW/s/FFtIWT8LEMaYiYzz0p8P/~/changes/11/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) for the 4 nodes +{% endstep %} + +{% step %} +### Serve the model There are block-wise and per-channel quantization methods, weights have already been quantized in these huggingface checkpoint: @@ -179,11 +301,26 @@ python3 -m sglang.launch_server \ {% endcode %} * [Optional Optimization Options](./#optional-performance-optimization) +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} ### Serving with 4 x 8 x L40S nodes -1. Install SGLang following [the instruction](https://app.gitbook.com/s/FFtIWT8LEMaYiYzz0p8P/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) for the 4 nodes -2. Serve the model +{% stepper %} +{% step %} +### Install SGLang + +Following [the instruction](https://app.gitbook.com/o/TvLfyTxdRQeudJH7e5QW/s/FFtIWT8LEMaYiYzz0p8P/~/changes/11/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) for the 4 nodes +{% endstep %} + +{% step %} +### Serve the model Running with per-channel quantization model: @@ -211,12 +348,30 @@ python3 -m sglang.launch_server --model meituan/DeepSeek-R1-Channel-INT8 --tp 32 --enable-torch-compile --torch-compile-max-bs 32 ``` {% endcode %} +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} ### Example: Serving on any cloud or Kubernetes with SkyPilot +{% stepper %} +{% step %} +### Install SkyPilot + SkyPilot helps find cheapest available GPUs across any cloud or existing Kubernetes clusters and launch distributed serving with a single command. See details [here](https://github.com/skypilot-org/skypilot/tree/master/llm/deepseek-r1). -To serve on multiple nodes: +```bash +git clone https://github.com/skypilot-org/skypilot.git +``` +{% endstep %} + +{% step %} +### Serve on multiple nodes {% code overflow="wrap" %} ```bash @@ -227,3 +382,11 @@ sky launch -c r1 llm/deepseek-r1/deepseek-r1-671B.yaml --retry-until-up sky launch -c r1 llm/deepseek-r1/deepseek-r1-671B-A100.yaml --retry-until-up ``` {% endcode %} +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} diff --git a/sglang-cookbook/model-recipes/gpt-oss/usage-guide.md b/sglang-cookbook/model-recipes/gpt-oss/usage-guide.md index bc13f92..a41b9b5 100644 --- a/sglang-cookbook/model-recipes/gpt-oss/usage-guide.md +++ b/sglang-cookbook/model-recipes/gpt-oss/usage-guide.md @@ -48,8 +48,15 @@ SGLang version (0.5.1) ### Serving with 2 x H100 -1. Install SGLang following [the instruction](https://app.gitbook.com/s/FFtIWT8LEMaYiYzz0p8P/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) -2. Serve the model +{% stepper %} +{% step %} +### Install SGLang + +Following [the instruction](https://app.gitbook.com/o/TvLfyTxdRQeudJH7e5QW/s/FFtIWT8LEMaYiYzz0p8P/~/changes/11/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) +{% endstep %} + +{% step %} +### Serve the model {% code overflow="wrap" %} ```bash @@ -57,11 +64,26 @@ SGLang version (0.5.1) python3 -m sglang.launch_server --model-path openai/gpt-oss-120b --tp 2 ``` {% endcode %} +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} ### Serving with 1 x B200 -* Install SGLang following [the instruction](../installation/nvidia-blackwell-gpus.md) -* Serve the model +{% stepper %} +{% step %} +### Install SGLang + +Following [the instruction](../installation/nvidia-blackwell-gpus.md) +{% endstep %} + +{% step %} +### Serve the model {% code overflow="wrap" %} ```bash @@ -76,8 +98,10 @@ python3 -m sglang.launch_server --model-path openai/gpt-oss-20b python3 -m sglang.launch_server --model-path openai/gpt-oss-120b ``` {% endcode %} +{% endstep %} -#### With Speculative Decoding +{% step %} +### With Speculative Decoding {% code overflow="wrap" %} ```bash @@ -97,6 +121,14 @@ python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algo E python3 -m sglang.launch_server --model openai/gpt-oss-120b --speculative-algo EAGLE3 --speculative-draft lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --attention-backend triton --tp 4 ``` {% endcode %} +{% endstep %} + +{% step %} +### Benchmark + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
Benchmark results will be added here
+{% endstep %} +{% endstepper %} ### Responses API & Built-in Tools From a600a785abe60b4a403046278dc571420aa40f22 Mon Sep 17 00:00:00 2001 From: Xiaotong Jiang Date: Sun, 7 Sep 2025 09:36:21 -0700 Subject: [PATCH 3/5] . --- .../model-recipes/deepseek-v3.1-v3-r1/usage-guide.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/sglang-cookbook/model-recipes/deepseek-v3.1-v3-r1/usage-guide.md b/sglang-cookbook/model-recipes/deepseek-v3.1-v3-r1/usage-guide.md index d9b9c86..72fd784 100644 --- a/sglang-cookbook/model-recipes/deepseek-v3.1-v3-r1/usage-guide.md +++ b/sglang-cookbook/model-recipes/deepseek-v3.1-v3-r1/usage-guide.md @@ -226,10 +226,6 @@ python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3-BF16 --tp 32 - ### Serving with 8 x A100 -<<<<<<< HEAD -1. Install SGLang following [the instruction](https://app.gitbook.com/s/FFtIWT8LEMaYiYzz0p8P/sglang-cookbook/installation/nvidia-h-series-a-series-and-rtx-gpus) -2. Serve the model -======= {% stepper %} {% step %} ### Install SGLang @@ -239,7 +235,6 @@ Following [the instruction](https://app.gitbook.com/o/TvLfyTxdRQeudJH7e5QW/s/FFt {% step %} ### Serve the model ->>>>>>> c926237 (.) {% code overflow="wrap" %} ```bash From 5b4e163f6b1ea4ef29bb47ca4f6b17c1a9b12723 Mon Sep 17 00:00:00 2001 From: Admin Date: Mon, 8 Sep 2025 14:16:56 +0000 Subject: [PATCH 4/5] GITBOOK-23: No subject --- SUMMARY.md | 2 ++ .../model-recipes/llama-4/README.md | 6 ++++ .../model-recipes/llama-4/usage-guide.md | 30 +++++++++++++++++++ 3 files changed, 38 insertions(+) create mode 100644 sglang-cookbook/model-recipes/llama-4/README.md create mode 100644 sglang-cookbook/model-recipes/llama-4/usage-guide.md diff --git a/SUMMARY.md b/SUMMARY.md index e0bbea5..71da277 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -15,6 +15,8 @@ * [Usage Guide](sglang-cookbook/model-recipes/deepseek-v3.1-v3-r1/usage-guide.md) * [GPT-OSS](sglang-cookbook/model-recipes/gpt-oss/README.md) * [Usage Guide](sglang-cookbook/model-recipes/gpt-oss/usage-guide.md) + * [Llama 4](sglang-cookbook/model-recipes/llama-4/README.md) + * [Usage Guide](sglang-cookbook/model-recipes/llama-4/usage-guide.md) * [API](sglang-cookbook/api/README.md) * [OpenAI APIs - Completions](sglang-cookbook/api/openai-apis-completions.md) * [OpenAI APIs - Vision](sglang-cookbook/api/openai-apis-vision.md) diff --git a/sglang-cookbook/model-recipes/llama-4/README.md b/sglang-cookbook/model-recipes/llama-4/README.md new file mode 100644 index 0000000..fc1964b --- /dev/null +++ b/sglang-cookbook/model-recipes/llama-4/README.md @@ -0,0 +1,6 @@ +# Llama 4 + +Llama 4 Scout + +
Weight TypeHardware ConfigurationInstructionBenchmark
4 x H100/H200Broken linkBroken link
8 x H100/H200Broken link
4 x MI300X

+ diff --git a/sglang-cookbook/model-recipes/llama-4/usage-guide.md b/sglang-cookbook/model-recipes/llama-4/usage-guide.md new file mode 100644 index 0000000..63d1af9 --- /dev/null +++ b/sglang-cookbook/model-recipes/llama-4/usage-guide.md @@ -0,0 +1,30 @@ +# Usage Guide + +### Serving with 1 x 4 x H200 + +{% stepper %} +{% step %} +#### Install SGLang + + +{% endstep %} + +{% step %} +#### Serve the model (text only) + +{% code overflow="wrap" %} +```bash +python3 -m sglang.launch_server \ + --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \ + --host 0.0.0.0 \ + --port 30000 +``` +{% endcode %} + + +{% endstep %} + +{% step %} +#### Benchmark +{% endstep %} +{% endstepper %} From 215ab38677e7e8a22e01e5550ae826f849d75f7f Mon Sep 17 00:00:00 2001 From: zhenlinc Date: Sun, 5 Oct 2025 11:10:09 +0000 Subject: [PATCH 5/5] GITBOOK-28: No subject --- SUMMARY.md | 4 ++ .../model-recipes/llama-3.1-70b/README.md | 2 + .../llama-3.1-70b/usage-guide.md | 66 ++++++++++++++++++ .../qwen3-next-80b-a3b/README.md | 2 + .../qwen3-next-80b-a3b/usage-guide.md | 68 +++++++++++++++++++ 5 files changed, 142 insertions(+) create mode 100644 sglang-cookbook/model-recipes/llama-3.1-70b/README.md create mode 100644 sglang-cookbook/model-recipes/llama-3.1-70b/usage-guide.md create mode 100644 sglang-cookbook/model-recipes/qwen3-next-80b-a3b/README.md create mode 100644 sglang-cookbook/model-recipes/qwen3-next-80b-a3b/usage-guide.md diff --git a/SUMMARY.md b/SUMMARY.md index 71da277..1277737 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -17,6 +17,10 @@ * [Usage Guide](sglang-cookbook/model-recipes/gpt-oss/usage-guide.md) * [Llama 4](sglang-cookbook/model-recipes/llama-4/README.md) * [Usage Guide](sglang-cookbook/model-recipes/llama-4/usage-guide.md) + * [Llama-3.1-70B](sglang-cookbook/model-recipes/llama-3.1-70b/README.md) + * [Usage Guide](sglang-cookbook/model-recipes/llama-3.1-70b/usage-guide.md) + * [Qwen3-Next-80B-A3B](sglang-cookbook/model-recipes/qwen3-next-80b-a3b/README.md) + * [Usage Guide](sglang-cookbook/model-recipes/qwen3-next-80b-a3b/usage-guide.md) * [API](sglang-cookbook/api/README.md) * [OpenAI APIs - Completions](sglang-cookbook/api/openai-apis-completions.md) * [OpenAI APIs - Vision](sglang-cookbook/api/openai-apis-vision.md) diff --git a/sglang-cookbook/model-recipes/llama-3.1-70b/README.md b/sglang-cookbook/model-recipes/llama-3.1-70b/README.md new file mode 100644 index 0000000..561ff8f --- /dev/null +++ b/sglang-cookbook/model-recipes/llama-3.1-70b/README.md @@ -0,0 +1,2 @@ +# Llama-3.1-70B + diff --git a/sglang-cookbook/model-recipes/llama-3.1-70b/usage-guide.md b/sglang-cookbook/model-recipes/llama-3.1-70b/usage-guide.md new file mode 100644 index 0000000..091c419 --- /dev/null +++ b/sglang-cookbook/model-recipes/llama-3.1-70b/usage-guide.md @@ -0,0 +1,66 @@ +# Usage Guide + +### Serving with 1 x 4 x H200 + +{% stepper %} +{% step %} +#### Install SGLang + +Following [the instruction](../../installation/nvidia-h-series-a-series-and-rtx-gpus.md) +{% endstep %} + +{% step %} +#### Serve the model + +```sh +python3 -m sglang.launch_server \ + --model meta-llama/Llama-3.1-70B-Instruct \ + --tp 4 --trust-remote-code \ + --mem-fraction-static 0.95 \ + --port 30000 \ + --attention-backend triton +``` +{% endstep %} + +{% step %} +#### Benchmark + +```shell +# BS=1/Input=1024/Ouput=1024 +python3 -m sglang.bench_one_batch_server \ + --model meta-llama/Llama-3.1-70B-Instruct \ + --base-url http://localhost:30000 \ + --batch-size 1 \ + --input-len 1024 \ + --output-len 1024 + + +# 1/8192/1024 +python3 -m sglang.bench_one_batch_server \ + --model meta-llama/Llama-3.1-70B-Instruct \ + --base-url http://localhost:30000 \ + --batch-size 1 \ + --input-len 8192 \ + --output-len 1024 + +# 8/1024/1024 +python3 -m sglang.bench_one_batch_server \ + --model meta-llama/Llama-3.1-70B-Instruct \ + --base-url http://localhost:30000 \ + --batch-size 8 \ + --input-len 1024 \ + --output-len 1024 + +# 8/8192/1024 +python3 -m sglang.bench_one_batch_server \ + --model meta-llama/Llama-3.1-70B-Instruct \ + --base-url http://localhost:30000 \ + --batch-size 8 \ + --input-len 8192 \ + --output-len 1024 +``` + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
1/1024/10240.23634418.1215.73
1/8192/10242.19503737.2419.75
8/1024/10240.58214052.0479.11
8/8192/10245.22312556.62355.16
+{% endstep %} +{% endstepper %} + diff --git a/sglang-cookbook/model-recipes/qwen3-next-80b-a3b/README.md b/sglang-cookbook/model-recipes/qwen3-next-80b-a3b/README.md new file mode 100644 index 0000000..6cfe5c7 --- /dev/null +++ b/sglang-cookbook/model-recipes/qwen3-next-80b-a3b/README.md @@ -0,0 +1,2 @@ +# Qwen3-Next-80B-A3B + diff --git a/sglang-cookbook/model-recipes/qwen3-next-80b-a3b/usage-guide.md b/sglang-cookbook/model-recipes/qwen3-next-80b-a3b/usage-guide.md new file mode 100644 index 0000000..7b79648 --- /dev/null +++ b/sglang-cookbook/model-recipes/qwen3-next-80b-a3b/usage-guide.md @@ -0,0 +1,68 @@ +# Usage Guide + +### Serving with 1 x 4 x H200 + +{% stepper %} +{% step %} +#### Install SGLang + +Following [the instruction](../../installation/nvidia-h-series-a-series-and-rtx-gpus.md) +{% endstep %} + +{% step %} +#### Serve the model + +```sh +python3 -m sglang.launch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Instruct \ + --tp 4 --trust-remote-code \ + --mem-fraction-static 0.95 \ + --port 30000 \ + --attention-backend triton +``` +{% endstep %} + +{% step %} +#### Benchmark + +```sh +# BS=1/Input=1024/Ouput=1024 +python3 -m sglang.bench_one_batch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Instruct \ + --base-url http://localhost:30000 \ + --batch-size 1 \ + --input-len 1024 \ + --output-len 1024 + + +# 1/8192/1024 +python3 -m sglang.bench_one_batch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Instruct \ + --base-url http://localhost:30000 \ + --batch-size 1 \ + --input-len 8192 \ + --output-len 1024 + +# 8/1024/1024 +python3 -m sglang.bench_one_batch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Instruct \ + --base-url http://localhost:30000 \ + --batch-size 8 \ + --input-len 1024 \ + --output-len 1024 + +# 8/8192/1024 +python3 -m sglang.bench_one_batch_server \ + --model Qwen/Qwen3-Next-80B-A3B-Instruct \ + --base-url http://localhost:30000 \ + --batch-size 8 \ + --input-len 8192 \ + --output-len 1024 +``` + +
BS/Input/Output LengthTTFT(s)ITL(ms)Input ThroughputOutput Throughput
1/1024/10240.40122547.9981.41
1/8192/10241.27156459.4567.89
8/1024/10240.9538665.16289.15
8/8192/10243.17220642.36474.05
+ +\ + +{% endstep %} +{% endstepper %}