-
Notifications
You must be signed in to change notification settings - Fork 36
Revert "Revert "some changes to support fine-tuning on Intel GPU (#88… #99
Conversation
carsonwang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late review. I'v added a few more comments. Can you please check them and submit a followup PR?
| ppl = math.exp(loss) | ||
| logger.info( | ||
| f"train epoch:[{idx}/{num_train_epochs}]\tstep:[{step}/{total_steps}]\tloss:{loss:.6f}\tppl:{math.exp(loss):.6f}\ttime:{time.time()-start:.6f}" | ||
| f"train epoch:[{idx}/{num_train_epochs}]\tstep:[{step}/{total_steps}]\tloss:{loss:.6f}\tppl:{ppl:.6f}\ttime:{time.time()-start:.6f}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of just output 0, 1, 2, etc, can we support output it like 0.1, 0.2, etc just like other workflows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will update in other PR.
| else total_steps, | ||
| } | ||
| ) | ||
| self.accelerator.log( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use Ray's report or accelerator.log to log the metrics. Currently the code above logs the metrics twice, right? If Ray's report already meets our requirements, I think we don't need to use accelerator.log to log again?
| |gpt_base_model|True|This parameter is for [Transformers#22482](https://github.com/huggingface/transformers/issues/22482). It needs to be set to True when the pretrained model is realted to gpt, otherwise it is False.| | ||
| |output_dir|/tmp/llm-ray/output|The output directory to store the finetuned model| | ||
| |checkpoint_dir|/tmp/llm-ray/checkpoint|The directory to store checkpoint| | ||
| |tracking_dir|/tmp/llm-ray/tracking|The path to a directory for storing logs of locally-compatible loggers| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we directly use the output_dir + "tracking" as the directory and not add this new parameter?
|
|
||
|
|
||
| def convert_dtype(dtype: str) -> torch.dtype: | ||
| supported_dtypes = {"fp16": torch.float16, "bf16": torch.bfloat16, "fp32": torch.float32} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You passed mixed_precision as the parameter, its value could be "no", "fp16", "bf16" or "fp8". But "no" and "fp8" are not properly handled here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
| if dtype in supported_dtypes: | ||
| return supported_dtypes[dtype] | ||
| else: | ||
| raise ValueError(f"only supported torch.dtype list [{supported_dtypes.keys()}]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the check in finetune_config.py instead of here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will update in other PR.
| num_cpus = ( | ||
| resources_per_worker["CPU"] * num_training_workers + 1 | ||
| ) # additional 1 for head worker | ||
| ray.init(num_cpus=num_cpus, runtime_env=runtime_env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this change and can we avoid this? If we start Ray first, then execute the finetune command, do we still need this change and does this change still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is a workaround solution.
ray.init will be blocked when we use llm-on-ray workflow on Intel GPU with torch/ipex version 2.1+ if not cpu or gpu resources passed.
| resources_per_worker: RayResourceConfig | ||
| accelerate_mode: str | ||
| mixed_precision: str = "no" | ||
| gradient_accumulation_steps: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you set the default value 1 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will update in other PR.
| resources_per_worker: | ||
| CPU: 32 | ||
| accelerate_mode: CPU_DDP | ||
| gradient_accumulation_steps: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value is 1 in our document. can you please set to 1 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will update in other PR.
…)" (#95)"
This reverts commit 63464ed.