Revert "Revert "some changes to support fine-tuning on Intel GPU (#88… #99

harborn · 2024-02-04T09:38:01Z

…)" (#95)"

This reverts commit 63464ed.

…el#88)" (intel#95)" This reverts commit 63464ed.

carsonwang

Sorry for late review. I'v added a few more comments. Can you please check them and submit a followup PR?

carsonwang · 2024-02-05T03:29:16Z

common/trainer/default_trainer.py

+                        ppl = math.exp(loss)
                        logger.info(
-                            f"train epoch:[{idx}/{num_train_epochs}]\tstep:[{step}/{total_steps}]\tloss:{loss:.6f}\tppl:{math.exp(loss):.6f}\ttime:{time.time()-start:.6f}"
+                            f"train epoch:[{idx}/{num_train_epochs}]\tstep:[{step}/{total_steps}]\tloss:{loss:.6f}\tppl:{ppl:.6f}\ttime:{time.time()-start:.6f}"


Instead of just output 0, 1, 2, etc, can we support output it like 0.1, 0.2, etc just like other workflows?

OK, will update in other PR.

carsonwang · 2024-02-05T03:31:42Z

common/trainer/default_trainer.py

                                else total_steps,
                            }
                        )
+                        self.accelerator.log(


Do we want to use Ray's report or accelerator.log to log the metrics. Currently the code above logs the metrics twice, right? If Ray's report already meets our requirements, I think we don't need to use accelerator.log to log again?

carsonwang · 2024-02-05T03:33:08Z

docs/finetune_parameters.md

 |gpt_base_model|True|This parameter is for [Transformers#22482](https://github.com/huggingface/transformers/issues/22482). It needs to be set to True when the pretrained model is realted to gpt, otherwise it is False.|
 |output_dir|/tmp/llm-ray/output|The output directory to store the finetuned model|
 |checkpoint_dir|/tmp/llm-ray/checkpoint|The directory to store checkpoint|
+|tracking_dir|/tmp/llm-ray/tracking|The path to a directory for storing logs of locally-compatible loggers|


Can we directly use the output_dir + "tracking" as the directory and not add this new parameter?

carsonwang · 2024-02-05T03:40:04Z

finetune/finetune.py



+def convert_dtype(dtype: str) -> torch.dtype:
+    supported_dtypes = {"fp16": torch.float16, "bf16": torch.bfloat16, "fp32": torch.float32}


You passed mixed_precision as the parameter, its value could be "no", "fp16", "bf16" or "fp8". But "no" and "fp8" are not properly handled here.

carsonwang · 2024-02-05T03:41:37Z

finetune/finetune.py

+    if dtype in supported_dtypes:
+        return supported_dtypes[dtype]
+    else:
+        raise ValueError(f"only supported torch.dtype list [{supported_dtypes.keys()}]")


can you add the check in finetune_config.py instead of here?

yes, will update in other PR.

carsonwang · 2024-02-05T03:50:48Z

finetune/finetune.py

+            num_cpus = (
+                resources_per_worker["CPU"] * num_training_workers + 1
+            )  # additional 1 for head worker
+            ray.init(num_cpus=num_cpus, runtime_env=runtime_env)


Why do we need this change and can we avoid this? If we start Ray first, then execute the finetune command, do we still need this change and does this change still work?

This change is a workaround solution.
ray.init will be blocked when we use llm-on-ray workflow on Intel GPU with torch/ipex version 2.1+ if not cpu or gpu resources passed.

carsonwang · 2024-02-05T03:57:08Z

finetune/finetune_config.py

    resources_per_worker: RayResourceConfig
    accelerate_mode: str
    mixed_precision: str = "no"
+    gradient_accumulation_steps: int


Can you set the default value 1 here?

yes, will update in other PR.

carsonwang · 2024-02-05T03:58:09Z

finetune/finetune.yaml

  resources_per_worker:
    CPU: 32
  accelerate_mode: CPU_DDP
+  gradient_accumulation_steps: 2


The default value is 1 in our document. can you please set to 1 here?

yes, will update in other PR.

Revert "Revert "some changes to support fine-tuning on Intel GPU (int…

35992b1

…el#88)" (intel#95)" This reverts commit 63464ed.

xwu-intel approved these changes Feb 5, 2024

View reviewed changes

xwu-intel merged commit 8286f68 into intel:main Feb 5, 2024

carsonwang reviewed Feb 5, 2024

View reviewed changes

carsonwang mentioned this pull request Feb 5, 2024

[UI] Add ui install script & fix ui #101

Merged



		def convert_dtype(dtype: str) -> torch.dtype:
		supported_dtypes = {"fp16": torch.float16, "bf16": torch.bfloat16, "fp32": torch.float32}

Revert "Revert "some changes to support fine-tuning on Intel GPU (#88… #99

Revert "Revert "some changes to support fine-tuning on Intel GPU (#88… #99

Uh oh!

Conversation

harborn commented Feb 4, 2024

Uh oh!

carsonwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants