-
Notifications
You must be signed in to change notification settings - Fork 36
some changes to support fine-tuning on Intel GPU #88
Conversation
| model = common.model.Model.registory.get("HuggingFaceModelForCausalLM")()( | ||
| config={ | ||
| "name": config["General"]["base_model"], | ||
| "dtype": convert_dtype(config["Training"]["mixed_precision"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
| "name": config["General"]["base_model"], | ||
| "dtype": convert_dtype(config["Training"]["mixed_precision"]), | ||
| "config": config["General"]["config"], | ||
| "enable_gradient_checkpointing": config["General"]["enable_gradient_checkpointing"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
finetune/finetune.py
Outdated
| runtime_env["pip"] = ["transformers==4.26.0"] | ||
|
|
||
| ray.init(runtime_env=runtime_env) | ||
| ray.init(num_cpus=num_cpus, runtime_env=runtime_env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to set cpu parameters here?
common/trainer/default_trainer.py
Outdated
| self.lr_scheduler.step() | ||
| self.optimizer.zero_grad() | ||
| if step % log_step == 0: | ||
| if step % gradient_accumulation_steps == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check if this line is needed, because we've set accumulation steps by
llm-on-ray/finetune/finetune.py
Lines 82 to 84 in fc06deb
| accelerator = accelerate.Accelerator( | |
| gradient_accumulation_steps=gradient_accumulation_steps, fsdp_plugin=fsdp_plugin | |
| ) |
self.accelerator.backward(loss). I'm not sure if this is correct or if there's conflict between them.
common/trainer/default_trainer.py
Outdated
| max_eval_step = self.config.get("max_eval_step") | ||
| gradient_accumulation_steps = self.accelerator.gradient_accumulation_steps | ||
| output = self.config.get("output", "./output") | ||
| writer = torch.utils.tensorboard.SummaryWriter(output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think the writer is needed, maybe ray.train.report is enough. After running tensorboard --logdir .../ray_results/TorchTrainer_2024* --bind_all we can see the parameters set in report(). @carsonwang please help confirm this.
jiafuzha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve it first for later dependent PR
|
CI failed when merged. could you check? @harborn |
[10:04 AM] Zhang, Jiafu |
the login was fixed. but it failed anyway. I would revert before it's fixed. |
No description provided.