Skip to content
This repository was archived by the owner on Sep 23, 2025. It is now read-only.

Conversation

@harborn
Copy link
Contributor

@harborn harborn commented Jan 26, 2024

No description provided.

model = common.model.Model.registory.get("HuggingFaceModelForCausalLM")()(
config={
"name": config["General"]["base_model"],
"dtype": convert_dtype(config["Training"]["mixed_precision"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

"name": config["General"]["base_model"],
"dtype": convert_dtype(config["Training"]["mixed_precision"]),
"config": config["General"]["config"],
"enable_gradient_checkpointing": config["General"]["enable_gradient_checkpointing"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.

runtime_env["pip"] = ["transformers==4.26.0"]

ray.init(runtime_env=runtime_env)
ray.init(num_cpus=num_cpus, runtime_env=runtime_env)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to set cpu parameters here?

self.lr_scheduler.step()
self.optimizer.zero_grad()
if step % log_step == 0:
if step % gradient_accumulation_steps == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if this line is needed, because we've set accumulation steps by

accelerator = accelerate.Accelerator(
gradient_accumulation_steps=gradient_accumulation_steps, fsdp_plugin=fsdp_plugin
)
and also use self.accelerator.backward(loss). I'm not sure if this is correct or if there's conflict between them.

max_eval_step = self.config.get("max_eval_step")
gradient_accumulation_steps = self.accelerator.gradient_accumulation_steps
output = self.config.get("output", "./output")
writer = torch.utils.tensorboard.SummaryWriter(output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think the writer is needed, maybe ray.train.report is enough. After running tensorboard --logdir .../ray_results/TorchTrainer_2024* --bind_all we can see the parameters set in report(). @carsonwang please help confirm this.

@jiafuzha jiafuzha self-requested a review February 2, 2024 01:39
Copy link
Contributor

@jiafuzha jiafuzha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve it first for later dependent PR

@harborn harborn merged commit a555e0c into intel:main Feb 2, 2024
@xwu-intel
Copy link

CI failed when merged. could you check? @harborn

@jiafuzha
Copy link
Contributor

jiafuzha commented Feb 4, 2024

CI failed when merged. could you check? @harborn

[10:04 AM] Zhang, Jiafu
应该是login 失效了,我重新login一下
[10:04 AM] Zhang, Jiafu
对,但是以前build失败都会抱出来
[10:07 AM] Zhang, Jiafu
现在应该好了,我在2.18重新login了一次
[10:08 AM] Zhang, Jiafu
你可以叫tianchen写一个relogin的脚步,这个docker login 没有参数让login永远不失效
[10:08 AM] Zhang, Jiafu
docker login https://amr-cache-registry.caas.intel.com/

@xwu-intel
Copy link

CI failed when merged. could you check? @harborn

[10:04 AM] Zhang, Jiafu 应该是login 失效了,我重新login一下 [10:04 AM] Zhang, Jiafu 对,但是以前build失败都会抱出来 [10:07 AM] Zhang, Jiafu 现在应该好了,我在2.18重新login了一次 [10:08 AM] Zhang, Jiafu 你可以叫tianchen写一个relogin的脚步,这个docker login 没有参数让login永远不失效 [10:08 AM] Zhang, Jiafu docker login https://amr-cache-registry.caas.intel.com/

the login was fixed. but it failed anyway. I would revert before it's fixed.

xwu-intel added a commit that referenced this pull request Feb 4, 2024
carsonwang pushed a commit that referenced this pull request Feb 4, 2024
harborn pushed a commit to harborn/llm-on-ray that referenced this pull request Feb 4, 2024
harborn pushed a commit to harborn/llm-on-ray that referenced this pull request Feb 4, 2024
xwu-intel pushed a commit that referenced this pull request Feb 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants