Skip to content
This repository was archived by the owner on Sep 23, 2025. It is now read-only.
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
d2d1f20
add benchmark run script, visualize script
KepingYan Apr 17, 2024
88cc01e
upd
KepingYan Apr 26, 2024
083ae60
update multi replicas
KepingYan May 7, 2024
4c6fa74
use --result-dir to parse results
KepingYan May 8, 2024
1b3b13a
fix ci proxy
KepingYan May 8, 2024
184e00e
add test ci
KepingYan May 9, 2024
bd85b7d
add license
KepingYan May 9, 2024
38c52ed
fix
KepingYan May 9, 2024
78dc091
fix
KepingYan May 9, 2024
7cc0de0
add autoscaling config
KepingYan May 10, 2024
e241b25
fix ci
KepingYan May 10, 2024
3eb1c08
fix ci
KepingYan May 10, 2024
882ff4d
add package matplotlib
KepingYan May 10, 2024
21994cd
verify CI test
KepingYan May 10, 2024
d688804
verify CI test
KepingYan May 11, 2024
c8eabbc
create assets folder to place pictures
KepingYan May 13, 2024
3905082
verify CI test
KepingYan May 13, 2024
97ec06a
support openai autoscaling
KepingYan May 13, 2024
606f286
remove
KepingYan May 13, 2024
55c1dd1
integrate vllm and ns
jiafuzha May 16, 2024
e709010
update config file
KepingYan May 17, 2024
5b1bd85
integrate vllm and ns
jiafuzha May 17, 2024
eb71ace
integrate vllm and ns
jiafuzha May 17, 2024
a969f7f
remove .eggs
jiafuzha May 17, 2024
1b6aba3
integration adjustment
jiafuzha May 17, 2024
ce3ac61
llm on ray deployed
jiafuzha May 20, 2024
213ad89
llm on ray deployed
jiafuzha May 20, 2024
9b4884f
llm on ray deployed
jiafuzha May 21, 2024
3cb6f64
more doc
jiafuzha May 21, 2024
3f9ba62
merge with master
jiafuzha May 21, 2024
f6d60be
more doc for installing vllm ext
jiafuzha May 21, 2024
04cddcf
Merge remote-tracking branch 'keping/test_benchmark_script' into vllm…
jiafuzha May 21, 2024
d0d40dd
Merge remote-tracking branch 'keping/autoscaling_config' into vllm-ns…
jiafuzha May 21, 2024
24cc480
bug fix
jiafuzha May 24, 2024
295186e
save
jiafuzha May 27, 2024
875aa89
add vllm-ext/requirements.txt
jiafuzha May 27, 2024
2a462ea
add CMakeLists.txt
jiafuzha May 27, 2024
a105321
changed benchmarks
jiafuzha May 27, 2024
6aa0540
tuned graph build
jiafuzha May 30, 2024
7d6d3b4
graph build time reduced
jiafuzha May 31, 2024
473671e
graph build time reduced
jiafuzha May 31, 2024
1a88edd
configurable perf stats and copy quant config automatically
jiafuzha Jun 4, 2024
dfd26b0
save test script
jiafuzha Jun 5, 2024
65c816f
add max_batched_tokens parameter
jiafuzha Jun 6, 2024
89936d3
adjustment and ray-vllm-examples
jiafuzha Jun 12, 2024
4f088e2
perf tuned and improved by disable mmap for multiple instances
jiafuzha Jun 17, 2024
597d83d
remove unnecessary thread sync in kernels
jiafuzha Jun 19, 2024
b093d3f
change order of loop, batch size first, then iteration
jiafuzha Jun 25, 2024
f1d06d9
add more parameters for vllm-ns test
JoshuaL3000 Jun 26, 2024
34664ed
add more parameters for vllm-ns test
JoshuaL3000 Jun 26, 2024
4782617
merged with master
jiafuzha Jun 27, 2024
04b7582
prevent quantization being messed-up with multiple processes
jiafuzha Jun 27, 2024
b791a1d
fix merge error
jiafuzha Jun 27, 2024
79e5daf
rename py to sh
jiafuzha Jun 27, 2024
2c9b287
fix formatting issue
jiafuzha Jun 27, 2024
5ac7907
fix formatting issue
jiafuzha Jun 27, 2024
19fc069
fix merge error
JoshuaL3000 Jun 27, 2024
76fe811
Merge remote-tracking branch 'refs/remotes/origin/vllm-ns-perf-test' …
jiafuzha Jun 27, 2024
5760c65
add vllm-ns ci
jiafuzha Jun 28, 2024
30efd3f
remove unnecessary logs
jiafuzha Jun 28, 2024
1d9b4e3
remove some debug code
jiafuzha Jun 28, 2024
a14a146
add '--privileged' to docker run
jiafuzha Jun 28, 2024
4f59cb8
set unlimited max lock memory for neural speed engine
jiafuzha Jun 28, 2024
c6a9149
extend token length limit to 8192 for mha
jiafuzha Jul 5, 2024
d1ca69e
extend token length limit to 8192 for mha
jiafuzha Jul 5, 2024
e9ed9af
extend token length limit to 8192 for mha (fix) and support different…
jiafuzha Jul 5, 2024
cbcccc0
extend token length limit to 8192 for mha (fix) and support different…
jiafuzha Jul 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/license/header_exclude_files.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
vllm-ext/vllm/extension/ns/__init__.py
8 changes: 6 additions & 2 deletions .github/workflows/workflow_inference.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
name: inference
strategy:
matrix:
model: [ gpt-j-6b, gpt2, bloom-560m, opt-125m, mpt-7b, mistral-7b-v0.1, mpt-7b-ipex-llm, neural-chat-7b-v3-1, CodeLlama-7b-hf, falcon-7b, starcoder, llama-2-7b-chat-hf, llama-2-7b-chat-hf-vllm, gemma-2b, deepseek-coder-33b-instruct]
model: [ gpt-j-6b, gpt2, bloom-560m, opt-125m, mpt-7b, mistral-7b-v0.1, mpt-7b-ipex-llm, neural-chat-7b-v3-1, CodeLlama-7b-hf, falcon-7b, starcoder, llama-2-7b-chat-hf, llama-2-7b-chat-hf-vllm, llama-2-7b-chat-hf-vllm-ns, gemma-2b, deepseek-coder-33b-instruct]
isPR:
- ${{inputs.ci_type == 'pr'}}

Expand Down Expand Up @@ -97,7 +97,11 @@ jobs:
run: |
TARGET=${{steps.target.outputs.target}}
source dev/scripts/ci-functions.sh
strat_ray ${TARGET}
if [[ "$TARGET" == *ns ]]; then
start_ray ${TARGET} 1
else
start_ray ${TARGET}
fi

- name: Run Inference Test
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/workflow_inference_gaudi2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ jobs:
# check and remove exited container
cid=$(docker ps -a -q --filter "name=${TARGET}")
if [[ ! -z "$cid" ]]; then docker rm $cid; fi
docker run -tid --name="${TARGET}" --hostname="${TARGET}-container" --runtime=habana -v /home/yizhong/Model-References:/root/Model-References -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub/ -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --cap-add sys_ptrace --net=host --ipc=host ${TARGET}:habana
docker run -tid --privileged --name="${TARGET}" --hostname="${TARGET}-container" --runtime=habana -v /home/yizhong/Model-References:/root/Model-References -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub/ -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --cap-add sys_ptrace --net=host --ipc=host ${TARGET}:habana
- name: Start Ray Cluster
run: |
TARGET=${{steps.target.outputs.target}}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/workflow_test_benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ jobs:
# check and remove exited container
cid=$(docker ps -a -q --filter "name=${TARGET}")
if [[ ! -z "$cid" ]]; then docker rm $cid; fi
docker run -tid -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -e http_proxy=${{ inputs.http_proxy }} -e https_proxy=${{ inputs.https_proxy }} --name="${TARGET}" --hostname="${TARGET}-container" ${TARGET}:latest
docker run -tid --privileged -v ${{ inputs.model_cache_path }}:/root/.cache/huggingface/hub -v ${{ inputs.code_checkout_path }}:/root/llm-on-ray -e http_proxy=${{ inputs.http_proxy }} -e https_proxy=${{ inputs.https_proxy }} --name="${TARGET}" --hostname="${TARGET}-container" ${TARGET}:latest

- name: Start Ray Cluster
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/workflow_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ jobs:
run: |
TARGET=${{steps.target.outputs.target}}
source dev/scripts/ci-functions.sh
strat_ray ${TARGET}
start_ray ${TARGET}

- name: Run Tests
run: |
Expand Down
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,9 @@ build/lib/
*.json
*.txt
*.egg-info
.eggs
*.log
*.so
*.ninja_log
build/
runtime_outs/
19 changes: 18 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ repos:
hooks:
- id: ruff
args: [ --fix, --exit-non-zero-on-fix, --ignore=E402, --ignore=E501, --ignore=E731, --ignore=F401]
exclude: |
(?x)^(
examples/inference/vllm/ray-vllm-examples/llm.py|
vllm-ext/vllm/extension/ns/__init__.py|
)$


# Black needs to be ran after ruff with --fix
- repo: https://github.com/psf/black
Expand All @@ -18,7 +24,18 @@ repos:
rev: "v0.981"
hooks:
- id: mypy
exclude: tests
exclude: |
(?x)^(
tests|
vllm-ext/vllm/extension/ns/model/ns_loader.py|
vllm-ext/vllm/extension/ns/kv_cache/ns_cache.py|
vllm-ext/inference_engine/python/inference_engine/|
vllm-ext/setup.py|
examples/inference/vllm/ray-vllm-examples/llm.py|
llm_on_ray/inference/inference_config.py|
vllm-ext/vllm/extension/ns/
)

additional_dependencies:
- mypy-extensions
- pydantic==1.10.0
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -284,7 +284,7 @@ async def send_request(

token_latencies_per_request: List[float] = []

timeout = aiohttp.ClientTimeout(total=3 * 3600)
timeout = aiohttp.ClientTimeout(total=5 * 3600)
async with aiohttp.ClientSession(timeout=timeout) as session:
while True:
async with session.post(api_url, headers=headers, json=pload) as response:
Expand Down
48 changes: 39 additions & 9 deletions benchmarks/run_benchmark.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,23 @@ then
echo "Please pass in the value of parameter RUN_MODE, which can be 'test' or 'benchmark'."
fi
VALUE_INF=2000

MAX_NUM_SEQS=$VALUE_INF
DYNAMIC_BATCH_SIZE=0
if [ "$#" -gt 2 ]
then
MAX_NUM_SEQS=${3}
fi
if [ "$#" -gt 3 ]
then
DYNAMIC_BATCH_SIZE=${4}
fi

MODEL_ENDPOINT="http://localhost:8000/llama-2-7b-chat-hf"
MODEL_NAME="llama-2-7b-chat-hf"
SHELL_FOLDER=$(cd "$(dirname "$0")";pwd)
BENCHMARK_SCRIPT=$SHELL_FOLDER"/benchmark_serving.py"
WITH_VLLM_CONFIG_FILE=$SHELL_FOLDER"/../llm_on_ray/inference/models/vllm/llama-2-7b-chat-hf-vllm.yaml"
WITH_VLLM_CONFIG_FILE=$SHELL_FOLDER"/../llm_on_ray/inference/models/vllm/llama-2-7b-chat-hf-vllm-ns.yaml"
WO_VLLM_CONFIG_FILE=$SHELL_FOLDER"/../llm_on_ray/inference/models/llama-2-7b-chat-hf.yaml"
DATASET_PATH=$SHELL_FOLDER"/../dataset"
DATASET_SHAREGPT_PATH=$SHELL_FOLDER"/../dataset/ShareGPT_V3_unfiltered_cleaned_split.json"
Expand Down Expand Up @@ -107,19 +119,37 @@ latency_throughput(){
tokens_dir=$choice_dir"/tokens_"$input_tokens_length"_"$output_tokens_length

# server
$NUMA_SERVER_COMMAND llm_on_ray-serve --config_file $WITH_VLLM_CONFIG_FILE --simple --max_ongoing_requests $VALUE_INF --max_num_seqs $VALUE_INF
#$numa_server_command llm_on_ray-serve --config_file $with_vllm_config_file --simple --max_concurrent_queries $VALUE_INF --vllm_max_num_seqs $VALUE_INF

# client
for i in $(seq 1 $num_iter)
for num_prompts in ${query_num}
do
echo "Run iter $i"
iter_dir=$tokens_dir"/iter_"$i
for num_prompts in ${query_num}
max_con_q=$VALUE_INF
if [ ! "$DYNAMIC_BATCH_SIZE" = "0" ]
then
if [ "$num_prompts" -lt "$NUM_REPLICA" ] || [ "$num_prompts" -eq "$NUM_REPLICA" ]
then
max_con_q=1
else
max_con_q=$((num_prompts/NUM_REPLICA))
fi
fi
echo "Run num_prompts ${num_prompts} ======================="
echo "deploying model with --max_concurrent_queries $max_con_q --vllm_max_num_seqs $MAX_NUM_SEQS ..."
$NUMA_SERVER_COMMAND llm_on_ray-serve --config_file $WITH_VLLM_CONFIG_FILE --simple --max_ongoing_requests $max_con_q --max_num_seqs $MAX_NUM_SEQS
sleep 1
for i in $(seq 0 $num_iter)
do
if [ $i = 0 ]; then
iter_dir="$tokens_dir/warmup"
echo "Run warmup"
else
iter_dir=$tokens_dir"/iter_"$i
echo "Run iter $i"
fi
results_dir=$iter_dir"/num_prompts_"$num_prompts
echo "Run num_prompts ${num_prompts}"
echo "results_dir: ${results_dir}"
$NUMA_CLIENT_COMMAND python $BENCHMARK_SCRIPT --model-endpoint-base $MODEL_ENDPOINT --model-name $MODEL_NAME --dataset $DATASET_IPEX_PATH --num-prompts $num_prompts --dataset-format IPEX --input-tokens $input_tokens_length --max-new-tokens $output_tokens_length --track-token-latency --vllm-engine --simple --results-dir $results_dir
$NUMA_CLIENT_COMMAND python $BENCHMARK_SCRIPT --model-endpoint-base $MODEL_ENDPOINT --model-name $MODEL_NAME --dataset $DATASET_IPEX_PATH --num-prompts $num_prompts --dataset-format IPEX --input-tokens $input_tokens_length --track-token-latency --max-new-tokens $output_tokens_length --vllm-engine --simple --results-dir $results_dir
done
done
echo "CHOICE 3 generation completed"
Expand Down Expand Up @@ -229,4 +259,4 @@ then
fi
output_tokens_length=32
get_best_latency $iter "${input_tokens_length[*]}" $output_tokens_length $benchmark_dir
fi
fi
42 changes: 42 additions & 0 deletions dev/docker/Dockerfile.vllm_ns
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# syntax=docker/dockerfile:1
FROM ubuntu:22.04

ENV LANG C.UTF-8

WORKDIR /root/llm-on-ray

RUN --mount=type=cache,target=/var/cache/apt apt-get update -y \
&& apt-get install -y build-essential cmake wget curl git vim htop ssh net-tools \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

ENV CONDA_DIR /opt/conda
RUN wget --quiet https://github.com/conda-forge/miniforge/releases/download/23.3.1-1/Miniforge3-Linux-x86_64.sh -O ~/miniforge.sh && \
/bin/bash ~/miniforge.sh -b -p /opt/conda
ENV PATH $CONDA_DIR/bin:$PATH

# setup env
SHELL ["/bin/bash", "--login", "-c"]

RUN --mount=type=cache,target=/opt/conda/pkgs conda init bash && \
unset -f conda && \
export PATH=$CONDA_DIR/bin/:${PATH} && \
mamba config --add channels intel && \
mamba install -y -c conda-forge python==3.9 gxx=12.3 gxx_linux-64=12.3 libxcrypt

COPY ./pyproject.toml .
COPY ./MANIFEST.in .


# Install llm_on_ray
# Create llm_on_ray package directory to bypass the following 'pip install -e' command
RUN mkdir ./llm_on_ray
RUN --mount=type=cache,target=/root/.cache/pip pip install -e .[vllm-cpu] --extra-index-url https://download.pytorch.org/whl/cpu \
--extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/

# Install vllm-ext
# We cannot make empty folder here like './llm_on_ray' since vllm-ext has cpp files to be compiled
COPY ./vllm-ext ./vllm-ext
COPY ./dev/scripts/check-vllm-cpu-build-env.sh ./dev/scripts/check-vllm-cpu-build-env.sh
RUN --mount=type=cache,target=/root/.cache/pip \
source /opt/conda/bin/activate base && cd vllm-ext && pip install . && pip install --upgrade protobuf
17 changes: 17 additions & 0 deletions dev/scripts/check-vllm-cpu-build-env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env bash

# Check tools
[[ -n $(which g++) ]] || { echo "GNU C++ Compiler (g++) is not found!"; exit 1; }
[[ -n $(which pip) ]] || { echo "pip command is not found!"; exit 1; }

# g++ version should be >=12.3. You can run the following to install GCC 12.3 and dependencies on conda:
# conda install -y -c conda-forge gxx=12.3 gxx_linux-64=12.3 libxcrypt
version_greater_equal()
{
printf '%s\n%s\n' "$2" "$1" | sort --check=quiet --version-sort
}
gcc_version=$(g++ --version | grep -o -E '[0-9]+\.[0-9]+\.[0-9]+' | head -n1)
echo
echo Current GNU C++ Compiler version: $gcc_version
echo
version_greater_equal "${gcc_version}" 12.3.0 || { echo "GNU C++ Compiler 12.3.0 or above is required!"; exit 1; }
17 changes: 14 additions & 3 deletions dev/scripts/ci-functions.sh
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ start_docker() {
docker_args+=("-e=https_proxy=${HTTPS_PROXY}")
fi

echo "docker run -tid "${docker_args[@]}" "${TARGET}:latest""
echo "docker run -tid --privileged "${docker_args[@]}" "${TARGET}:latest""
docker run -tid "${docker_args[@]}" "${TARGET}:latest"
}

Expand All @@ -75,11 +75,19 @@ install_dependencies(){
docker exec "${TARGET}" bash -c "pip install -r ./tests/requirements.txt"
}

strat_ray(){
start_ray(){
local TARGET=$1
local UNLIMITED_MAXLOCKMEM=0
if [ "$2" == "1" ]; then
UNLIMITED_MAXLOCKMEM=1
fi

# Start Ray Cluster
docker exec "${TARGET}" bash -c "./dev/scripts/start-ray-cluster.sh"
if [ "$UNLIMITED_MAXLOCKMEM" == "1" ]; then
docker exec "${TARGET}" bash -c "ulimit -l unlimited; ./dev/scripts/start-ray-cluster.sh"
else
docker exec "${TARGET}" bash -c "./dev/scripts/start-ray-cluster.sh"
fi
}

stop_ray(){
Expand Down Expand Up @@ -111,6 +119,7 @@ declare -A DF_SUFFIX_MAPPER
DF_SUFFIX_MAPPER=(
["mpt-7b-ipex-llm"]=".ipex-llm"
["llama-2-7b-chat-hf-vllm"]=".vllm"
["llama-2-7b-chat-hf-vllm-ns"]=".vllm_ns"
["gpt-j-6b"]=".cpu_and_deepspeed.pip_non_editable"
)

Expand All @@ -128,6 +137,7 @@ declare -A TARGET_SUFFIX_MAPPER
TARGET_SUFFIX_MAPPER=(
["mpt-7b-ipex-llm"]="_ipex-llm"
["llama-2-7b-chat-hf-vllm"]="_vllm"
["llama-2-7b-chat-hf-vllm-ns"]="_vllm-ns"
)

get_TARGET_SUFFIX() {
Expand All @@ -143,6 +153,7 @@ declare -A INFERENCE_MAPPER
INFERENCE_MAPPER=(
["mpt-7b-ipex-llm"]="llm_on_ray-serve --config_file llm_on_ray/inference/models/ipex-llm/mpt-7b-ipex-llm.yaml --simple"
["llama-2-7b-chat-hf-vllm"]="llm_on_ray-serve --config_file .github/workflows/config/llama-2-7b-chat-hf-vllm-fp32.yaml --simple"
["llama-2-7b-chat-hf-vllm-ns"]="llm_on_ray-serve --config_file llm_on_ray/inference/models/vllm/llama2-7b-chat-hf-vllm-ns.yaml --simple --max_ongoing_requests 1 --max_num_seqs 1"
["default"]="llm_on_ray-serve --simple --models ${model}"
)

Expand Down
Binary file modified docs/assets/choice3_tokens_32_64.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 33 additions & 0 deletions docs/vllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,30 @@ Then please run the following script to install vLLM for CPU into your LLM-on-Ra
dev/scripts/install-vllm-cpu.sh
```

## Install vLLM Extension for Quantization (Optional)
To further speed up quantized model inference on Intel CPU, we extend vLLM to run the model decoding in own own inference engine, which is based on [https://github.com/intel/neural-speed](neural-speed).
Neural Speed is an innovative library designed to support the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization powered by
[https://github.com/intel/neural-compressor](Intel Neural Compressor). The work is inspired by [https://github.com/ggerganov/llama.cpp](llama.cpp) and further optimized for Intel platforms with our
innovations in [https://arxiv.org/abs/2311.00502](NeurIPS' 2023).

You need to first install llm-on-ray with "vllm-cpu" extra.

```bash
pip install .[vllm-cpu] --extra-index-url https://download.pytorch.org/whl/cpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/
```

Then, install the vLLM extension and the inference engine.
```bash
cd vllm-ext
pip install .

```

## Run

#### Serving

* Vanilla vLLM
To serve model with vLLM and simple protocol, run the following:

```bash
Expand All @@ -36,6 +56,19 @@ llm_on_ray-serve --config_file llm_on_ray/inference/models/vllm/llama-2-7b-chat-

In the above example, `vllm` property is set to `true` in the config file for enabling vLLM.

* vLLM Extension
To serve model with vLLM extension with Intel inference engine, run with following (Note: only Llama-2-7b-chat-hf is supported for now):

```bash
# copy quantization config file to your specific snapshot dir, for example .../snapshots/f5db02db7.../
# the quant_ns_config.json will be copied from llm_on_ray package with default config if you don't copy your desired one manually.
cp llm_on_ray/inference/models/vllm/quantization/quant_ns_config.json <your model snapshot dir>
# deploy model serving. Note: It includes quantizing the model on the fly based on the quant_ns_config.json if it has not been quantized.
llm_on_ray-serve --config_file llm_on_ray/inference/models/vllm/llama-2-7b-chat-hf-vllm-ns.yaml --simple --keep_serve_terminal --max_num_seqs 64
```

For now, only Llama2 model is supported.

#### Querying

To start a non-streaming query, run the following:
Expand Down
Loading