some changes to support fine-tuning on Intel GPU #88

harborn · 2024-01-26T08:02:45Z

No description provided.

common/trainer/default_trainer.py

KepingYan · 2024-01-31T05:24:00Z

finetune/finetune.py

    model = common.model.Model.registory.get("HuggingFaceModelForCausalLM")()(
        config={
            "name": config["General"]["base_model"],
+            "dtype": convert_dtype(config["Training"]["mixed_precision"]),


KepingYan · 2024-01-31T05:24:03Z

finetune/finetune.py

            "name": config["General"]["base_model"],
+            "dtype": convert_dtype(config["Training"]["mixed_precision"]),
            "config": config["General"]["config"],
+            "enable_gradient_checkpointing": config["General"]["enable_gradient_checkpointing"],


finetune/finetune.py

KepingYan · 2024-01-31T05:25:53Z

finetune/finetune.py

            runtime_env["pip"] = ["transformers==4.26.0"]

-        ray.init(runtime_env=runtime_env)
+        ray.init(num_cpus=num_cpus, runtime_env=runtime_env)


Why do we need to set cpu parameters here?

KepingYan · 2024-01-31T05:45:19Z

common/trainer/default_trainer.py

-                        self.lr_scheduler.step()
-                    self.optimizer.zero_grad()
-                    if step % log_step == 0:
+                    if step % gradient_accumulation_steps == 0:


Please check if this line is needed, because we've set accumulation steps by

llm-on-ray/finetune/finetune.py

Lines 82 to 84 in fc06deb

accelerator = accelerate.Accelerator(

gradient_accumulation_steps=gradient_accumulation_steps, fsdp_plugin=fsdp_plugin

)

and also use self.accelerator.backward(loss). I'm not sure if this is correct or if there's conflict between them.

finetune/finetune_config.py

KepingYan · 2024-01-31T06:24:24Z

common/trainer/default_trainer.py

        max_eval_step = self.config.get("max_eval_step")
+        gradient_accumulation_steps = self.accelerator.gradient_accumulation_steps
+        output = self.config.get("output", "./output")
+        writer = torch.utils.tensorboard.SummaryWriter(output)


I don’t think the writer is needed, maybe ray.train.report is enough. After running tensorboard --logdir .../ray_results/TorchTrainer_2024* --bind_all we can see the parameters set in report(). @carsonwang please help confirm this.

jiafuzha

Approve it first for later dependent PR

xwu-intel · 2024-02-04T02:19:59Z

CI failed when merged. could you check? @harborn

jiafuzha · 2024-02-04T02:23:05Z

CI failed when merged. could you check? @harborn

[10:04 AM] Zhang, Jiafu
应该是login 失效了，我重新login一下
[10:04 AM] Zhang, Jiafu
对，但是以前build失败都会抱出来
[10:07 AM] Zhang, Jiafu
现在应该好了，我在2.18重新login了一次
[10:08 AM] Zhang, Jiafu
你可以叫tianchen写一个relogin的脚步，这个docker login 没有参数让login永远不失效
[10:08 AM] Zhang, Jiafu
docker login https://amr-cache-registry.caas.intel.com/

xwu-intel · 2024-02-04T02:26:44Z

CI failed when merged. could you check? @harborn

[10:04 AM] Zhang, Jiafu 应该是login 失效了，我重新login一下 [10:04 AM] Zhang, Jiafu 对，但是以前build失败都会抱出来 [10:07 AM] Zhang, Jiafu 现在应该好了，我在2.18重新login了一次 [10:08 AM] Zhang, Jiafu 你可以叫tianchen写一个relogin的脚步，这个docker login 没有参数让login永远不失效 [10:08 AM] Zhang, Jiafu docker login https://amr-cache-registry.caas.intel.com/

the login was fixed. but it failed anyway. I would revert before it's fixed.

This reverts commit a555e0c.

…ntel#95) This reverts commit a555e0c.

…el#88)" (intel#95)" This reverts commit 63464ed.

…" (#95)" (#99) This reverts commit 63464ed.

Wu, Gangsheng added 8 commits January 26, 2024 16:08

some changes to support fine-tuning on Intel GPU

5fb0415

update

f89f7c2

update

9ec36bb

update

2f540f2

upate

dcf9fb5

update

d4be3b6

update

0e9932b

update

1b4223c

KepingYan reviewed Jan 31, 2024

View reviewed changes

Wu, Gangsheng added 8 commits January 31, 2024 14:52

update

802d16a

update

db06df6

update

0068374

update

ece20fa

update

6dd7b10

update

27b5053

update

0aa6af7

update

8e5de6e

jiafuzha self-requested a review February 2, 2024 01:39

jiafuzha approved these changes Feb 2, 2024

View reviewed changes

harborn merged commit a555e0c into intel:main Feb 2, 2024

xwu-intel added a commit that referenced this pull request Feb 4, 2024

Revert "some changes to support fine-tuning on Intel GPU (#88)"

4751fef

This reverts commit a555e0c.

xwu-intel mentioned this pull request Feb 4, 2024

Revert "some changes to support fine-tuning on Intel GPU (#88)" #95

Merged

carsonwang pushed a commit that referenced this pull request Feb 4, 2024

Revert "some changes to support fine-tuning on Intel GPU (#88)" (#95)

63464ed

This reverts commit a555e0c.

harborn pushed a commit to harborn/llm-on-ray that referenced this pull request Feb 4, 2024

Revert "some changes to support fine-tuning on Intel GPU (intel#88)" (i…

42570f4

…ntel#95) This reverts commit a555e0c.

harborn pushed a commit to harborn/llm-on-ray that referenced this pull request Feb 4, 2024

Revert "Revert "some changes to support fine-tuning on Intel GPU (int…

35992b1

…el#88)" (intel#95)" This reverts commit 63464ed.

xwu-intel pushed a commit that referenced this pull request Feb 5, 2024

Revert "Revert "some changes to support fine-tuning on Intel GPU (#88)…

8286f68

…" (#95)" (#99) This reverts commit 63464ed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

some changes to support fine-tuning on Intel GPU #88

some changes to support fine-tuning on Intel GPU #88

Uh oh!

harborn commented Jan 26, 2024

Uh oh!

Uh oh!

KepingYan Jan 31, 2024

Uh oh!

KepingYan Jan 31, 2024

Uh oh!

Uh oh!

KepingYan Jan 31, 2024

Uh oh!

KepingYan Jan 31, 2024

Uh oh!

Uh oh!

KepingYan Jan 31, 2024

Uh oh!

jiafuzha left a comment

Uh oh!

xwu-intel commented Feb 4, 2024

Uh oh!

jiafuzha commented Feb 4, 2024

Uh oh!

xwu-intel commented Feb 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	accelerator = accelerate.Accelerator(
	gradient_accumulation_steps=gradient_accumulation_steps, fsdp_plugin=fsdp_plugin
	)

some changes to support fine-tuning on Intel GPU #88

some changes to support fine-tuning on Intel GPU #88

Uh oh!

Conversation

harborn commented Jan 26, 2024

Uh oh!

Uh oh!

KepingYan Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

KepingYan Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KepingYan Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

KepingYan Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KepingYan Jan 31, 2024

Choose a reason for hiding this comment

Uh oh!

jiafuzha left a comment

Choose a reason for hiding this comment

Uh oh!

xwu-intel commented Feb 4, 2024

Uh oh!

jiafuzha commented Feb 4, 2024

Uh oh!

xwu-intel commented Feb 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants