-
Notifications
You must be signed in to change notification settings - Fork 214
Fix optimizer load state dict copy tensor by reference #1173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix optimizer load state dict copy tensor by reference #1173
Conversation
|
Also fixes #1176 with the conditional state being copied to the right device. |
src/TorchSharp/Optimizers/SGD.cs
Outdated
| public override void LoadStateDict(BinaryReader reader) | ||
| { | ||
| LoadConditionalStateTensor(reader, ref momentum_buffer); | ||
| momentum_buffer = momentum_buffer.to(_parameter.device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to dispose the old momentum buffer if it's actually moved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Do you want me to write something like:
if (momentum_buffer.device_type != _parameter.device_type || momentum_buffer.device_index != _parameter.device_index) {
using var copy = result;
result = copy.to(_parameter.device);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That's not handled in the regular Optimizer.to(Device device) function. See for example here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's sufficient to check the Handle before and after the call to `to'. I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha
In that case, should I go with my proposed solution?
Also, does that mean we should update all the Optimizer.to(...) functions?
Maybe we should have a parameter dispose_if_moved to the to function to make it generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's handled in the Module variants of to(). Follow that pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth writing an internal utility function that has the 'move or not move' predicate, so it can be reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d1b68ad to
bbc569e
Compare
|
@NiklasGustafsson What do you think of my solution? I wrote a unit test attempting to move and cast tensors to the same and different types/devices, and made sure the behavior was as expected. What do you think? If this solution is good, I'll wait for this to be merged and then use it in the other PR. |
|
@shaltielshmid -- here's a tip: Once you have created a PR, try to consolidate a number of commits before pushing again. Every push starts a new build on the CI/CD pipeline, which wastes resources. |
|
Of course, I apologize. |
| step = st_state.step; | ||
| square_avg = st_state.square_avg; | ||
| acc_delta = st_state.acc_delta; | ||
| square_avg = st_state.square_avg.to(_parameter.device, copy: true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to set copy to true here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the State Dictionary is actually from another existing optimizer, then the tensors are copied by reference and we don't want two different optimizers sharing a state tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
var lin1 = torch.nn.Linear(10, 10);
var optim1 = torch.optim.Adam(lin1.parameters(), 0.05f);
var lin2 = torch.nn.Linear(10, 10);
var optim2 = torch.optim.Adam(lin2.parameters());
optim2.load_state_dict(optim1.state_dict())There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, shouldn't the 'copy' argument be passed down from where you know whether it's necessary or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add it in, if you think it's the right way to it.
The way I see it, the StateDictionary is an object which is created only by calling Optimizer.state_dict(), and the source Optimizer is the one which manages the tensors that are in the dictionary. If they are to be disposed, then the Optimizer would handle them. And therefore, the new Optimizer should receive a fresh copy of the tensors to manage itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning, the StateDictionary is just an wrapper interface for accessing the values of an Optimizer.
I think we don't ever want two Optimizers to share the same tensor handle.
That being said, if you disagree and would like me to add a copy parameter to the load_state_dict function, no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I think that's probably right. If we later conceive of a case where you do want to share state, we can add an argument later and set the default to 'true'
|
You have run the unit tests on CUDA, right? |
|
Yup |
|
Merged |
Fixes: #1172