Skip to content

Conversation

@dhkim0225
Copy link
Contributor

@dhkim0225 dhkim0225 commented Jan 12, 2021

What does this PR do?

Fixes #5460, #5456, #4927

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Check that target branch and milestone match!

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Jan 12, 2021

Hello @dhkim0225! Thanks for updating this PR.

Line 107:121: E501 line too long (122 > 120 characters)

Comment last updated at 2021-01-29 16:27:27 UTC

@codecov
Copy link

codecov bot commented Jan 12, 2021

Codecov Report

Merging #5477 (8a18422) into release/1.2-dev (c8f605e) will increase coverage by 4%.
The diff coverage is 43%.

@@               Coverage Diff                @@
##           release/1.2-dev   #5477    +/-   ##
================================================
+ Coverage               89%     93%    +4%     
================================================
  Files                  153     152     -1     
  Lines                10803   10757    -46     
================================================
+ Hits                  9610    9958   +348     
+ Misses                1193     799   -394     

Copy link

@priancho priancho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to allow a user to use p-norm with any p value (not just l2-norm), I think that it is much more readable to use two separate flags, "gradient_clip_algorithm" and "gradient_clip_norm_type" than using one flag for two purposes.

norm <https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_norm_>`_ computed over all model parameters together.
Gradient clipping may be enabled to avoid exploding gradients. Also, you can choose various criterion by
`gradient_clip_algorithm` option. For example, if `gradient_clip_algorithm == 'value'`, this will `clip the gradient
by value <https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_value_>`_ computed over all model parameters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation of its behavior when gradient_clip_algorithm is 'value' is incorrect.
What about the following one?

Gradient clipping may be enabled to avoid exploding gradients. By default, this will clip the gradient norm <https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_norm_>_ computed over all model parameters together. If gradient_clip_algorithm option is set to 'value', which is 'norm' by default, this will clip the gradient value <https://pytorch.org/docs/stable/nn.html#torch.nn.utils.clip_grad_value_> for each parameter instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for comment ! I'll change it

optimizer: Optimizer,
grad_clip_val: Union[float, int],
gradient_clip_algorithm: str,
norm_type: Union[float, int]):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the default value for norm_type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that norm_type default value is following trainer's default value.
And thought the default value of (local) function can confuse users.

if gradient_clip_algorithm == 'value':
torch.nn.utils.clip_grad_value_(parameters, clip_value=grad_clip_val)
elif gradient_clip_algorithm.startswith('norm'):
max_norm = grad_clip_val

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we can't use torch.nn.utils.clip_grad_norm_() method because episilon value added to the denominator during gradient scaling is hard-coded as "1e-6" in that method :-0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.
However, since native amp plugin uses torch.nn.utils.clip_grad_norm_ (epsilon 1e-6), I wonder why.

norm_type: Union[float, int]):
if gradient_clip_algorithm == 'value':
raise NotImplementedError("Value grad clipping with sharded ddp is not implemented yet")
elif gradient_clip_algorithm.startswith('norm'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following two lines are missing.

    max_norm = grad_clip_val
    norm_type = float(2.0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think max_norm variable doesn't have to be reassigned. grad_clip_val and norm_type can be directly given as an argument

@dhkim0225 dhkim0225 marked this pull request as draft January 13, 2021 02:04
@dhkim0225
Copy link
Contributor Author

Thank you for reviewing my PR @priancho !
I'm totally agree with you.

If we need to allow a user to use p-norm with any p value (not just l2-norm), I think that it is much more readable to use two separate flags, "gradient_clip_algorithm" and "gradient_clip_norm_type" than using one flag for two purposes.

Since I'll be busy for the next couple of days, be working after this Saturday.

@dhkim0225 dhkim0225 marked this pull request as ready for review January 18, 2021 06:19
@dhkim0225
Copy link
Contributor Author

@priancho @tchaton updated based on pre-review.

@dhkim0225
Copy link
Contributor Author

remove conflict on changelog

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding some tests ?

else:
model = self.trainer.get_model()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=grad_clip_val, norm_type=norm_type)
if clip_algorithm == 'value':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use an LightningEnum there .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review. I'll change my code.

p.grad.data.mul_(clip_coef.to(p.grad.data.device))
if gradient_clip_algorithm == 'value':
torch.nn.utils.clip_grad_value_(parameters, clip_value=grad_clip_val)
elif gradient_clip_algorithm.startswith('norm'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using .startswith('norm') ?

Copy link
Contributor Author

@dhkim0225 dhkim0225 Jan 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's totally my mistake. Will be changed.
The previous implementation use ['norm' + str(norm_type)] inputs like norm2 and norm3.

gradient_clip_algorithm: str,
norm_type: Union[float, int]):
if gradient_clip_algorithm == 'value':
raise NotImplementedError("Value grad clipping with sharded ddp is not implemented yet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open an issue on FairScale to ask for them to support this feature. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above, the vanilla clip_grad_value should just work actually

@dhkim0225 dhkim0225 marked this pull request as draft January 20, 2021 01:49
@mergify mergify bot removed the has conflicts label Jan 26, 2021
@mergify mergify bot removed the has conflicts label Jan 28, 2021
@Borda Borda added the feature Is an improvement or enhancement label Jan 29, 2021
@Borda Borda added this to the 1.2 milestone Jan 29, 2021
Copy link
Collaborator

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the first glance looks good to me, @SeanNaren mind review?

Copy link

@priancho priancho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a line of code that should be deleted.

@SeanNaren
Copy link
Contributor

I love this! great work on the PR :) we're doing a bit of an accelerator refactor and it might be better for these changes to end up in the new API, thoughts @tchaton @awaelchli @justusschock?

@mergify mergify bot added the has conflicts label Feb 3, 2021
@edenlightning edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021
Base automatically changed from release/1.2-dev to master February 11, 2021 14:32
@lexierule lexierule requested a review from carmocca as a code owner February 11, 2021 14:32
@dhkim0225
Copy link
Contributor Author

@SeanNaren
So.... it seems like there're a lot of changes.
Is there anything I can do for this PR?

@tchaton
Copy link
Contributor

tchaton commented Feb 15, 2021

Hey @dhkim0225,

We merged some big refactor on accelerators. The simplest would be to rebase on master.
Best,
T.C

@dhkim0225
Copy link
Contributor Author

@tchaton Okay, I'll be work in this Friday

@tchaton
Copy link
Contributor

tchaton commented Feb 22, 2021

Dear @dhkim0225 ,

Any updates. Do you need to help with rebasing ?
Or you could checkout from master and cherry-pick your commit.

Best,
T.C

@dhkim0225
Copy link
Contributor Author

dhkim0225 commented Feb 22, 2021

@tchaton Sorry for the delay. I'm just started.
And as you said, opening a new PR with cherry-picking could be better.
Closing PR.

@dhkim0225 dhkim0225 closed this Feb 22, 2021
@dhkim0225 dhkim0225 deleted the feature/clip_grad_by_value_1.2-dev branch February 22, 2021 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Is an improvement or enhancement has conflicts

Projects

None yet

Development

Successfully merging this pull request may close these issues.

clip_gradient with clip_grad_value

9 participants