-
Notifications
You must be signed in to change notification settings - Fork 3.6k
add clip_grad_by_value feature #5477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add clip_grad_by_value feature #5477
Conversation
|
Hello @dhkim0225! Thanks for updating this PR.
Comment last updated at 2021-01-29 16:27:27 UTC |
Codecov Report
@@ Coverage Diff @@
## release/1.2-dev #5477 +/- ##
================================================
+ Coverage 89% 93% +4%
================================================
Files 153 152 -1
Lines 10803 10757 -46
================================================
+ Hits 9610 9958 +348
+ Misses 1193 799 -394 |
priancho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to allow a user to use p-norm with any p value (not just l2-norm), I think that it is much more readable to use two separate flags, "gradient_clip_algorithm" and "gradient_clip_norm_type" than using one flag for two purposes.
docs/source/training_tricks.rst
Outdated
| norm <https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_norm_>`_ computed over all model parameters together. | ||
| Gradient clipping may be enabled to avoid exploding gradients. Also, you can choose various criterion by | ||
| `gradient_clip_algorithm` option. For example, if `gradient_clip_algorithm == 'value'`, this will `clip the gradient | ||
| by value <https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_value_>`_ computed over all model parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation of its behavior when gradient_clip_algorithm is 'value' is incorrect.
What about the following one?
Gradient clipping may be enabled to avoid exploding gradients. By default, this will clip the gradient norm <https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_norm_>_ computed over all model parameters together. If gradient_clip_algorithm option is set to 'value', which is 'norm' by default, this will clip the gradient value <https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_value_> for each parameter instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for comment ! I'll change it
| optimizer: Optimizer, | ||
| grad_clip_val: Union[float, int], | ||
| gradient_clip_algorithm: str, | ||
| norm_type: Union[float, int]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove the default value for norm_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping that norm_type default value is following trainer's default value.
And thought the default value of (local) function can confuse users.
| if gradient_clip_algorithm == 'value': | ||
| torch.nn.utils.clip_grad_value_(parameters, clip_value=grad_clip_val) | ||
| elif gradient_clip_algorithm.startswith('norm'): | ||
| max_norm = grad_clip_val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we can't use torch.nn.utils.clip_grad_norm_() method because episilon value added to the denominator during gradient scaling is hard-coded as "1e-6" in that method :-0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
However, since native amp plugin uses torch.nn.utils.clip_grad_norm_ (epsilon 1e-6), I wonder why.
| norm_type: Union[float, int]): | ||
| if gradient_clip_algorithm == 'value': | ||
| raise NotImplementedError("Value grad clipping with sharded ddp is not implemented yet") | ||
| elif gradient_clip_algorithm.startswith('norm'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following two lines are missing.
max_norm = grad_clip_val
norm_type = float(2.0)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think max_norm variable doesn't have to be reassigned. grad_clip_val and norm_type can be directly given as an argument
|
Thank you for reviewing my PR @priancho !
Since I'll be busy for the next couple of days, be working after this Saturday. |
|
remove conflict on changelog |
tchaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding some tests ?
| else: | ||
| model = self.trainer.get_model() | ||
| torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=grad_clip_val, norm_type=norm_type) | ||
| if clip_algorithm == 'value': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use an LightningEnum there .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review. I'll change my code.
| p.grad.data.mul_(clip_coef.to(p.grad.data.device)) | ||
| if gradient_clip_algorithm == 'value': | ||
| torch.nn.utils.clip_grad_value_(parameters, clip_value=grad_clip_val) | ||
| elif gradient_clip_algorithm.startswith('norm'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using .startswith('norm') ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's totally my mistake. Will be changed.
The previous implementation use ['norm' + str(norm_type)] inputs like norm2 and norm3.
| gradient_clip_algorithm: str, | ||
| norm_type: Union[float, int]): | ||
| if gradient_clip_algorithm == 'value': | ||
| raise NotImplementedError("Value grad clipping with sharded ddp is not implemented yet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open an issue on FairScale to ask for them to support this feature. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
facebookresearch/fairscale#308
Opened ... !! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above, the vanilla clip_grad_value should just work actually
…/pytorch-lightning into feature/clip_grad_by_value_1.2-dev
Borda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the first glance looks good to me, @SeanNaren mind review?
priancho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a line of code that should be deleted.
|
I love this! great work on the PR :) we're doing a bit of an accelerator refactor and it might be better for these changes to end up in the new API, thoughts @tchaton @awaelchli @justusschock? |
|
@SeanNaren |
|
Hey @dhkim0225, We merged some big refactor on accelerators. The simplest would be to rebase on master. |
|
@tchaton Okay, I'll be work in this Friday |
|
Dear @dhkim0225 , Any updates. Do you need to help with rebasing ? Best, |
|
@tchaton Sorry for the delay. I'm just started. |
What does this PR do?
Fixes #5460, #5456, #4927
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃