Skip to content
This repository was archived by the owner on Sep 23, 2025. It is now read-only.

Conversation

@jiafuzha
Copy link
Contributor

No description provided.

jiafuzha added 2 commits June 27, 2024 07:26
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
JoshuaL3000 and others added 5 commits June 27, 2024 13:47
Signed-off-by: JoshuaL3000 <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
@jiafuzha
Copy link
Contributor Author

@KepingYan @carsonwang @xwu99 please help review.

@xwu-intel
Copy link

@KepingYan @carsonwang @xwu99 please help review.

Great work! I will check this week and let you know.

jiafuzha added 4 commits July 5, 2024 07:00
… threads for prompt decoding and next token decoding

Signed-off-by: Jiafu Zhang <[email protected]>
… threads for prompt decoding and next token decoding

Signed-off-by: Jiafu Zhang <[email protected]>


def _verify_quntization(self):
if self.quantization is not None and self.quantization == "ns":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if self.quantization is not None and self.quantization == "ns":
if self.quantization == "ns":

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

option(IE_AVX "inference_engine: enable AVX" ON)
option(IE_AVX2 "inference_engine: enable AVX2" ON)
option(IE_F16C "inference_engine: enable F16C" ON)
option(IE_AVX512 "inference_engine: enable AVX512" OFF)
Copy link

@xwu-intel xwu-intel Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we build AVX512 by default? I am not sure the usage of the option here. How could we check if AVX512 build is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually not used for now. Instrinsics are checked dynamically during run-time. Will keep it here for later.

model_loader = importlib.import_module("vllm.model_executor.model_loader")
importlib.reload(model_loader)

logger.info("__ns extension: use ns model loader for ns model, %s", NSModelLoaderV2.__name__)
Copy link

@xwu-intel xwu-intel Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind summarize a table of which properties are monkey-patched with related NS classes in the PR description for better understanding?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All monkey-patches are in this init.py. They are,

init.py:52] __ns extension: add ns quantization config, NSQuantConfig
init.py:105] __ns extension: use ns model loader for ns model, NSModelLoaderV2
init.py:116] __ns extension: replace LlamaModel with ns LLamaModel, NSLLamaModel
init.py:136] __ns extension: use ns cache engine for ns, NSCPUCacheEngine
init.py:146] __ns extension: replace execute_model in cpu_model_runner, execute_model
init.py:171] __ns extension: replace BlockSpaceManager.get_block_space_manager_class in vllm.core.interfaces with get_block_space_manager_class

if infer_conf.vllm.extension == "ns":
logger.warn("applying neural speed extension to vllm ...")
try:
from vllm.extension import ns as ns
Copy link

@xwu-intel xwu-intel Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are several places of this one

Suggested change
from vllm.extension import ns as ns
from vllm.extension import ns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

physical_cores = psutil.cpu_count(logical=False)
# reserve one core for non-ns tasks
physical_cores = physical_cores if physical_cores <= 1 else physical_cores - 1
threads = int(os.environ.get(_NS_NUM_THREADS, str(physical_cores)))
Copy link

@xwu-intel xwu-intel Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider set the cores for ns according to cpus_per_worker in the inference config, all other predictors use this config to consistently assign cpu resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's NS threads which is different from ray cpus. I want to separate them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just came to my mind that bigger --num-cpus value causes ray starts up more background processes, like client.poll or server.poll processes which hurts overall performance. That's the reason why I set num-cpus to 1 to reduce them. Need to find an elegant way to balance them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As tested, I can use 'OMP_NUM_THREADS=1' env to reduce these background processes no matter what --num-cpus value is.

# get available cores
try:
max_prompt_tokens = int(os.environ.get(_NS_MAX_PROMPT_TOKENS, "8192"))
# cpus_per_work is set to 1 for better ns perf so it's inappropriate to use ray to get available cores
Copy link

@xwu-intel xwu-intel Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can directly use cpus_per_worker to set cpu cores for ns, no need to set cpus_per_worker to 1 and use another env for ns. see below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we cannot. cpus_per_worker value is set to OMP_NUM_THREADS. I actually need to set OMP_NUM_THREADS to 1 to get better perf.

@xwu-intel
Copy link

vllm-ext/inference_engine/cpp/models/application this directory should be in vllm-ext/inference_engine/cpp according to neural_speed code structure

@jiafuzha
Copy link
Contributor Author

vllm-ext/inference_engine/cpp/models/application this directory should be in vllm-ext/inference_engine/cpp according to neural_speed code structure

I am not 100% follow neural_speed structure since we only take it as inference engine.

@jiafuzha
Copy link
Contributor Author

It's code before merge. I've addressed all review comments in after merge branch. So, let me close this one and submit a new PR in after merge branch.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants