-
Notifications
You must be signed in to change notification settings - Fork 36
Integrate vllm and inference engine (neural speed) #264
Conversation
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: JoshuaL3000 <[email protected]>
…into vllm-ns-perf-test
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
|
@KepingYan @carsonwang @xwu99 please help review. |
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
Great work! I will check this week and let you know. |
Signed-off-by: Jiafu Zhang <[email protected]>
Signed-off-by: Jiafu Zhang <[email protected]>
… threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]>
… threads for prompt decoding and next token decoding Signed-off-by: Jiafu Zhang <[email protected]>
|
|
||
|
|
||
| def _verify_quntization(self): | ||
| if self.quantization is not None and self.quantization == "ns": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if self.quantization is not None and self.quantization == "ns": | |
| if self.quantization == "ns": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
| option(IE_AVX "inference_engine: enable AVX" ON) | ||
| option(IE_AVX2 "inference_engine: enable AVX2" ON) | ||
| option(IE_F16C "inference_engine: enable F16C" ON) | ||
| option(IE_AVX512 "inference_engine: enable AVX512" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we build AVX512 by default? I am not sure the usage of the option here. How could we check if AVX512 build is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually not used for now. Instrinsics are checked dynamically during run-time. Will keep it here for later.
| model_loader = importlib.import_module("vllm.model_executor.model_loader") | ||
| importlib.reload(model_loader) | ||
|
|
||
| logger.info("__ns extension: use ns model loader for ns model, %s", NSModelLoaderV2.__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind summarize a table of which properties are monkey-patched with related NS classes in the PR description for better understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All monkey-patches are in this init.py. They are,
init.py:52] __ns extension: add ns quantization config, NSQuantConfig
init.py:105] __ns extension: use ns model loader for ns model, NSModelLoaderV2
init.py:116] __ns extension: replace LlamaModel with ns LLamaModel, NSLLamaModel
init.py:136] __ns extension: use ns cache engine for ns, NSCPUCacheEngine
init.py:146] __ns extension: replace execute_model in cpu_model_runner, execute_model
init.py:171] __ns extension: replace BlockSpaceManager.get_block_space_manager_class in vllm.core.interfaces with get_block_space_manager_class
| if infer_conf.vllm.extension == "ns": | ||
| logger.warn("applying neural speed extension to vllm ...") | ||
| try: | ||
| from vllm.extension import ns as ns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are several places of this one
| from vllm.extension import ns as ns | |
| from vllm.extension import ns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
| physical_cores = psutil.cpu_count(logical=False) | ||
| # reserve one core for non-ns tasks | ||
| physical_cores = physical_cores if physical_cores <= 1 else physical_cores - 1 | ||
| threads = int(os.environ.get(_NS_NUM_THREADS, str(physical_cores))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider set the cores for ns according to cpus_per_worker in the inference config, all other predictors use this config to consistently assign cpu resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's NS threads which is different from ray cpus. I want to separate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just came to my mind that bigger --num-cpus value causes ray starts up more background processes, like client.poll or server.poll processes which hurts overall performance. That's the reason why I set num-cpus to 1 to reduce them. Need to find an elegant way to balance them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As tested, I can use 'OMP_NUM_THREADS=1' env to reduce these background processes no matter what --num-cpus value is.
| # get available cores | ||
| try: | ||
| max_prompt_tokens = int(os.environ.get(_NS_MAX_PROMPT_TOKENS, "8192")) | ||
| # cpus_per_work is set to 1 for better ns perf so it's inappropriate to use ray to get available cores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can directly use cpus_per_worker to set cpu cores for ns, no need to set cpus_per_worker to 1 and use another env for ns. see below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we cannot. cpus_per_worker value is set to OMP_NUM_THREADS. I actually need to set OMP_NUM_THREADS to 1 to get better perf.
|
|
I am not 100% follow neural_speed structure since we only take it as inference engine. |
|
It's code before merge. I've addressed all review comments in after merge branch. So, let me close this one and submit a new PR in after merge branch. |
No description provided.