-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Hardware specific parts of Accelerator Refactoring #5719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hello @justusschock! Thanks for updating this PR.
Comment last updated at 2021-02-01 12:37:53 UTC |
Co-Authored with @awaelchi
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
a950f48 to
398abb9
Compare
Co-Authored with @awaelchi
Co-authored-by: Adrian Wälchli <[email protected]>
Co-authored-by: Adrian Wälchli <[email protected]>
Codecov Report
@@ Coverage Diff @@
## release/1.2-dev #5719 +/- ##
================================================
- Coverage 89% 86% -4%
================================================
Files 173 181 +8
Lines 12495 13110 +615
================================================
+ Hits 11175 11226 +51
- Misses 1320 1884 +564 |
3019414 to
1085a23
Compare
|
seems as last issue: |
tchaton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justusschock Really good work there ! Not all comments should be adressed in this PR.
|
|
||
| if self.is_global_zero: | ||
| # load weights saved in ddp | ||
| path = os.path.join(original_model.trainer.default_root_dir, "__temp_weight_distributed_end.ckpt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like __temp_weight_distributed_end.ckpt should be a GLOBAL property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean by global property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class
TEMPORARY_WEIGHT_PATH = "__temp_weight_distributed_end.ckpt"
def __init__
...
def
path = os.path.join(original_model.trainer.default_root_dir, self.TEMPORARY_WEIGHT_PATH)
|
|
||
| # load weights if not interrupted | ||
| # TODO: check for trainer reference | ||
| if self.on_colab_kaggle and not model.trainer.testing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a ColabEnv for TPU and move some logic there ? In another PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I completely understand this... You mean a ClusterEnvironment for that? We can certainly look into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure Colab is a ClusterEnviromenent.
I meant more something like TPURunningEnviromenent -> Colab, Kaggle_colab, etc...
Best,
T.C
Co-authored-by: chaton <[email protected]>
Co-authored-by: chaton <[email protected]>
Co-authored-by: chaton <[email protected]>
Co-authored-by: chaton <[email protected]>
What does this PR do?
Adds the Hardware specific parts of the refactoring (#5616 ) as Accelerators for CPU, GPU and TPU, the accelerator connector and the specific plugins for single-device training, single tpu training and multiple tpu training (TPUSpawn).
Only files with new classes are added and some imports are changed. No integration into Trainer yet.
Should be merged after #5715 #5718 #5714