Integrate vllm and inference engine (neural speed) #264

jiafuzha · 2024-06-27T05:54:12Z

No description provided.

Signed-off-by: Jiafu Zhang <[email protected]>

Signed-off-by: JoshuaL3000 <[email protected]>

…into vllm-ns-perf-test

Signed-off-by: Jiafu Zhang <[email protected]>

jiafuzha · 2024-06-28T08:07:55Z

@KepingYan @carsonwang @xwu99 please help review.

Signed-off-by: Jiafu Zhang <[email protected]>

xwu-intel · 2024-07-02T07:30:25Z

@KepingYan @carsonwang @xwu99 please help review.

Great work! I will check this week and let you know.

Signed-off-by: Jiafu Zhang <[email protected]>

… threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]>

xwu-intel · 2024-07-09T02:29:53Z

vllm-ext/vllm/extension/ns/__init__.py

+
+
+def _verify_quntization(self):
+    if self.quantization is not None and self.quantization == "ns":


Suggested change

if self.quantization is not None and self.quantization == "ns":

if self.quantization == "ns":

xwu-intel · 2024-07-09T02:34:37Z

vllm-ext/inference_engine/cpp/CMakeLists.txt

+option(IE_AVX                    "inference_engine: enable AVX"                                     ON)
+option(IE_AVX2                   "inference_engine: enable AVX2"                                    ON)
+option(IE_F16C                   "inference_engine: enable F16C"                                    ON)
+option(IE_AVX512                 "inference_engine: enable AVX512"                                  OFF)


Do we build AVX512 by default? I am not sure the usage of the option here. How could we check if AVX512 build is enabled?

It's actually not used for now. Instrinsics are checked dynamically during run-time. Will keep it here for later.

xwu-intel · 2024-07-09T03:11:40Z

vllm-ext/vllm/extension/ns/__init__.py

+model_loader = importlib.import_module("vllm.model_executor.model_loader")
+importlib.reload(model_loader)
+
+logger.info("__ns extension: use ns model loader for ns model, %s", NSModelLoaderV2.__name__)


Do you mind summarize a table of which properties are monkey-patched with related NS classes in the PR description for better understanding?

All monkey-patches are in this init.py. They are,

init.py:52] __ns extension: add ns quantization config, NSQuantConfig
init.py:105] __ns extension: use ns model loader for ns model, NSModelLoaderV2
init.py:116] __ns extension: replace LlamaModel with ns LLamaModel, NSLLamaModel
init.py:136] __ns extension: use ns cache engine for ns, NSCPUCacheEngine
init.py:146] __ns extension: replace execute_model in cpu_model_runner, execute_model
init.py:171] __ns extension: replace BlockSpaceManager.get_block_space_manager_class in vllm.core.interfaces with get_block_space_manager_class

xwu-intel · 2024-07-09T03:16:31Z

llm_on_ray/inference/predictors/vllm_predictor.py

+        if infer_conf.vllm.extension == "ns":
+            logger.warn("applying neural speed extension to vllm ...")
+            try:
+                from vllm.extension import ns as ns


there are several places of this one

Suggested change

from vllm.extension import ns as ns

from vllm.extension import ns

xwu-intel · 2024-07-09T08:37:56Z

vllm-ext/vllm/extension/ns/model/ns_model.py

+            physical_cores = psutil.cpu_count(logical=False)
+            # reserve one core for non-ns tasks
+            physical_cores = physical_cores if physical_cores <= 1 else physical_cores - 1
+            threads = int(os.environ.get(_NS_NUM_THREADS, str(physical_cores)))


consider set the cores for ns according to cpus_per_worker in the inference config, all other predictors use this config to consistently assign cpu resources.

It's NS threads which is different from ray cpus. I want to separate them.

Just came to my mind that bigger --num-cpus value causes ray starts up more background processes, like client.poll or server.poll processes which hurts overall performance. That's the reason why I set num-cpus to 1 to reduce them. Need to find an elegant way to balance them.

As tested, I can use 'OMP_NUM_THREADS=1' env to reduce these background processes no matter what --num-cpus value is.

xwu-intel · 2024-07-09T08:39:45Z

vllm-ext/vllm/extension/ns/model/ns_model.py

+        # get available cores
+        try:
+            max_prompt_tokens = int(os.environ.get(_NS_MAX_PROMPT_TOKENS, "8192"))
+            # cpus_per_work is set to 1 for better ns perf so it's inappropriate to use ray to get available cores


you can directly use cpus_per_worker to set cpu cores for ns, no need to set cpus_per_worker to 1 and use another env for ns. see below.

no, we cannot. cpus_per_worker value is set to OMP_NUM_THREADS. I actually need to set OMP_NUM_THREADS to 1 to get better perf.

xwu-intel · 2024-07-10T03:08:45Z

vllm-ext/inference_engine/cpp/models/application this directory should be in vllm-ext/inference_engine/cpp according to neural_speed code structure

jiafuzha · 2024-07-12T02:05:35Z

vllm-ext/inference_engine/cpp/models/application this directory should be in vllm-ext/inference_engine/cpp according to neural_speed code structure

I am not 100% follow neural_speed structure since we only take it as inference engine.

jiafuzha · 2024-07-16T08:40:58Z

It's code before merge. I've addressed all review comments in after merge branch. So, let me close this one and submit a new PR in after merge branch.

KepingYan and others added 30 commits April 17, 2024 09:12

add benchmark run script, visualize script

d2d1f20

upd

88cc01e

update multi replicas

083ae60

use --result-dir to parse results

4c6fa74

fix ci proxy

1b3b13a

add test ci

184e00e

add license

bd85b7d

fix

38c52ed

fix

78dc091

add autoscaling config

7cc0de0

fix ci

e241b25

fix ci

3eb1c08

add package matplotlib

882ff4d

verify CI test

21994cd

verify CI test

d688804

create assets folder to place pictures

c8eabbc

verify CI test

3905082

support openai autoscaling

97ec06a

remove

606f286

integrate vllm and ns

55c1dd1

Signed-off-by: Jiafu Zhang <[email protected]>

update config file

e709010

integrate vllm and ns

5b1bd85

Signed-off-by: Jiafu Zhang <[email protected]>

integrate vllm and ns

eb71ace

Signed-off-by: Jiafu Zhang <[email protected]>

remove .eggs

a969f7f

Signed-off-by: Jiafu Zhang <[email protected]>

integration adjustment

1b6aba3

Signed-off-by: Jiafu Zhang <[email protected]>

llm on ray deployed

ce3ac61

Signed-off-by: Jiafu Zhang <[email protected]>

llm on ray deployed

213ad89

Signed-off-by: Jiafu Zhang <[email protected]>

llm on ray deployed

9b4884f

Signed-off-by: Jiafu Zhang <[email protected]>

more doc

3cb6f64

Signed-off-by: Jiafu Zhang <[email protected]>

merge with master

3f9ba62

Signed-off-by: Jiafu Zhang <[email protected]>

jiafuzha added 2 commits June 27, 2024 07:26

fix formatting issue

2c9b287

Signed-off-by: Jiafu Zhang <[email protected]>

fix formatting issue

5ac7907

Signed-off-by: Jiafu Zhang <[email protected]>

jiafuzha requested review from KepingYan, carsonwang and xwu-intel June 27, 2024 08:10

JoshuaL3000 and others added 5 commits June 27, 2024 13:47

fix merge error

19fc069

Signed-off-by: JoshuaL3000 <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/vllm-ns-perf-test' …

76fe811

…into vllm-ns-perf-test

add vllm-ns ci

5760c65

Signed-off-by: Jiafu Zhang <[email protected]>

remove unnecessary logs

30efd3f

Signed-off-by: Jiafu Zhang <[email protected]>

remove some debug code

1d9b4e3

Signed-off-by: Jiafu Zhang <[email protected]>

jiafuzha added 2 commits June 28, 2024 08:25

add '--privileged' to docker run

a14a146

Signed-off-by: Jiafu Zhang <[email protected]>

set unlimited max lock memory for neural speed engine

4f59cb8

Signed-off-by: Jiafu Zhang <[email protected]>

jiafuzha added 4 commits July 5, 2024 07:00

extend token length limit to 8192 for mha

c6a9149

Signed-off-by: Jiafu Zhang <[email protected]>

extend token length limit to 8192 for mha

d1ca69e

Signed-off-by: Jiafu Zhang <[email protected]>

extend token length limit to 8192 for mha (fix) and support different…

e9ed9af

… threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]>

extend token length limit to 8192 for mha (fix) and support different…

cbcccc0

… threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]>

xwu-intel reviewed Jul 9, 2024

View reviewed changes

jiafuzha closed this Jul 16, 2024

jiafuzha mentioned this pull request Jul 16, 2024

Integrate vllm/ns into llm-on-ray #267

Open

jiafuzha mentioned this pull request Jul 29, 2024

Vllm ns merged 209 7d49516 #272

Merged



		def _verify_quntization(self):
		if self.quantization is not None and self.quantization == "ns":

	if self.quantization is not None and self.quantization == "ns":
	if self.quantization == "ns":

	from vllm.extension import ns as ns
	from vllm.extension import ns

Integrate vllm and inference engine (neural speed) #264

Integrate vllm and inference engine (neural speed) #264

Uh oh!

Conversation

jiafuzha commented Jun 27, 2024

Uh oh!

jiafuzha commented Jun 28, 2024

Uh oh!

xwu-intel commented Jul 2, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xwu-intel Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xwu-intel Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xwu-intel Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xwu-intel Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xwu-intel Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xwu-intel commented Jul 10, 2024

Uh oh!

jiafuzha commented Jul 12, 2024

Uh oh!

jiafuzha commented Jul 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xwu-intel Jul 9, 2024 •

edited

Loading

xwu-intel Jul 9, 2024 •

edited

Loading

xwu-intel Jul 9, 2024 •

edited

Loading

xwu-intel Jul 9, 2024 •

edited

Loading

xwu-intel Jul 9, 2024 •

edited

Loading