Skip to content
This repository was archived by the owner on Sep 23, 2025. It is now read-only.

Conversation

@Deegue
Copy link
Contributor

@Deegue Deegue commented Jun 20, 2024

Debug mode with logs and some improvements & questions.

def streaming_generate(self, prompt, streamer, **config):
self._process_config(config)
# Q1: Why it is handled here when using both deepspeed and hpu?
if self.infer_conf.deepspeed:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here in hpu_predictor,py, it is a little bit confused since we have another predictor called deepspeed_predicotr.
Two predictors are for hpu and cpu, maybe we can change the name of deepspeed_predicotr, like cpu or base predictor.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a TODO comment to consolidate these two predictors.

# So that some errors like OOM can be prevented, and the server will be more robust.
if self.infer_conf.deepspeed:
return ray.get(
[worker.generate.remote(prompt, **config) for worker in self.deepspeed_workers]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the fixed work, maybe we should spread the load to all cards when deepspeed is enabled.

Copy link

@xwu-intel xwu-intel Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently tensor parallelism is used to process single request, 1 process per card is the industrial best practice given the load is balanced. You might think each request send to different card, that is not the case.

@xwu-intel xwu-intel requested a review from KepingYan July 1, 2024 02:21
@xwu-intel
Copy link

@KepingYan could you help to review this and discuss with @Deegue

Comment on lines +90 to +94
# optimize transformers for gaudi
from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi

adapt_transformers_to_gaudi()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move this function here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this function out of this if:

Both with deepspeed or not will execute this function.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand from the PR title, this PR is to add debug mode, why touch other code? Could you submit a separate PR to address other issues.

Comment on lines -287 to -290
# optimize transformers for gaudi
from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi

adapt_transformers_to_gaudi()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function is not executed in every worker, will it work as expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, this function will be executed earlier.

Comment on lines 187 to 188
# Q2: Why always use the first worker?
return ray.get(self.deepspeed_workers[0].get_streamer.remote())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xwu99 , Could you please help explain this question, I think it is the same idea as in deepspeed_predictor.py.

Copy link

@xwu-intel xwu-intel Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a distributed inference which involving a group of worker processes, worker 0 is assigned as rank 0 (according to the standard MPI model) which is the main rank to return the result, other ranks only engage in calculation, not returning the result. In fact, all ranks holding the same result in this case. You can consider rank 0 is head process of the distributed worker group that is usually used to return the result.

Comment on lines 230 to 231
# Q2: Why always use the first worker?
self.deepspeed_workers[0].streaming_generate.remote(prompt, streamer, **config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xwu99 Same here.

@xwu-intel xwu-intel requested a review from yutianchen666 July 5, 2024 01:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants