Skip to content

Conversation

@ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Feb 7, 2022

What does this PR do?

  • Enables automatic hardware selection without duplicating code across Trainer & individual accelerator implementations. This can be further enhanced by the addition of an AcceleratorRegistry. The accelerator connector could iterate through all of the registered accelerators, call acc_cls.is_available() and determine the hardware immediately.

This aims to simplify the accelerator connector logic and rewrite effort in #11448
This would move the hardcoded, duplicated assertion logic from Trainer constructor checks to a single runtime check at trainer.fit/validate/test/predict calls

Given discussion on this PR, we can decide where the assertion on device availability should happen. Either in the accelerator init, setup_environment, or left up to individual accelerators to decide

Fixes #11818

Does your PR introduce any breaking changes? If yes, please list them.

Yes, this now:

  • raises a TypeError if a custom accelerator has not implemented the new abstract method
  • raises a RuntimeError if the configured hardware is not available during Trainer execution

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • [n/a] Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@carmocca carmocca added this to the 1.6 milestone Feb 7, 2022
@ananthsub ananthsub changed the title Add assertions to GPU accelerator for CUDA availability Add assertions to GPU accelerator for device availability Feb 7, 2022
@mergify mergify bot added the ready PRs ready to be merged label Feb 7, 2022
@mergify mergify bot added the has conflicts label Feb 7, 2022
@ananthsub ananthsub force-pushed the feat/gpu-validation branch from e2dcbfe to 9cc7c00 Compare February 9, 2022 09:48
@mergify mergify bot removed the has conflicts label Feb 9, 2022
rohitgr7
rohitgr7 previously approved these changes Feb 9, 2022
@carmocca
Copy link
Contributor

carmocca commented Feb 9, 2022

Does this close #11799 and #11798?

@rohitgr7 rohitgr7 dismissed their stale review February 9, 2022 18:39

just need some clarifications

@ananthsub
Copy link
Contributor Author

Does this close #11799 and #11798?

Yes, I will close them after this is merged

@ananthsub ananthsub merged commit 1b107c5 into Lightning-AI:master Feb 9, 2022
@ananthsub ananthsub deleted the feat/gpu-validation branch February 9, 2022 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

accelerator: cuda Compute Unified Device Architecture GPU breaking change Includes a breaking change ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Add Accelerator.is_available() interface requirement

7 participants