Skip to content

Conversation

@akihironitta
Copy link
Contributor

@akihironitta akihironitta commented Jun 3, 2022

What does this PR do?

Pins the docker version to make some time to fix the following issues.

Failures

  1. We've been seeing hangs since around (see ones terminated in 1hr 40min):

Screen Shot 2022-06-07 at 12 51 03

  1. and a horovod version mismatch issue (ones failing within 30-40min):

Screen Shot 2022-06-07 at 12 52 04

Does your PR introduce any breaking changes? If yes, please list them.

None

Before submitting

  • [n/a] Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • [n/a] Did you make sure to update the documentation with your changes? (if necessary)
  • [n/a] Did you write any new necessary tests? (not for typos and docs)
  • [n/a] Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • [n/a] Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @tchaton @rohitgr7 @carmocca @akihironitta @Borda

@akihironitta akihironitta added the ci Continuous Integration label Jun 7, 2022
@akihironitta akihironitta marked this pull request as ready for review June 7, 2022 03:49
@akihironitta akihironitta added this to the 1.6.x milestone Jun 7, 2022
@akihironitta akihironitta changed the title ci/gpu-hang Temporarily pin docker image used in GPU CI Jun 7, 2022
@akihironitta
Copy link
Contributor Author

akihironitta commented Jun 7, 2022

Just realised that there seems another issue 😞 I'll be AFK for a few hours, so if someone has an idea, any help is welcome while I'm away...

E           RuntimeError: CUDA error: the launch timed out and was terminated

https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=74048&view=logs&j=5ea502cf-d418-510c-3b5f-c4ba606ae534&t=1700365d-a4bb-5551-f2b3-aeaf0f75aa1c

@carmocca
Copy link
Contributor

carmocca commented Jun 7, 2022

Might be caused by the fact that @Borda recently switched to K80 GPUs. I don't think this is a problem in our codebase.

@Borda
Copy link
Collaborator

Borda commented Jun 7, 2022

Might be caused by the fact that @Borda recently switched to K80 GPUs. I don't think this is a problem in our codebase.

for TM we just migrated to use LTS with CUDA 10.2, see Lightning-AI/torchmetrics#1071 but the error was different

@akihironitta
Copy link
Contributor Author

@Borda Thank you for working on the investigation of the above error in #13245.

Looked for similar issues on GH and SO, but all I could find so far is only NVIDIA/MinkowskiEngine#235 (K80 and CUDA 10.2) which has no solution...

@akihironitta akihironitta added the priority: 0 High priority task label Jun 8, 2022
@akihironitta akihironitta marked this pull request as draft June 8, 2022 00:21
@Borda
Copy link
Collaborator

Borda commented Jun 8, 2022

I think we can close it as it turned out the failing builds were unrelated to the used docker image but to the recent switch of resources... the NC12 used in the last few days has been likely sharing GPU with other machines... so switch to NC24 seem to resolve the issue 🐰
see: https://docs.microsoft.com/en-us/azure/virtual-machines/nc-series
image

@Borda Borda deleted the ci/gpu-hang branch June 8, 2022 22:26
@akihironitta akihironitta removed the priority: 0 High priority task label Jun 9, 2022
@akihironitta akihironitta restored the ci/gpu-hang branch June 9, 2022 02:04
@akihironitta akihironitta reopened this Jun 9, 2022
@akihironitta akihironitta deleted the ci/gpu-hang branch June 9, 2022 03:16
@akihironitta
Copy link
Contributor Author

I think we can close it ... the NC12 used in the last few days has been likely sharing GPU with other machines... so switch to NC24 seem to resolve the issue 🐰

@Borda Does it? It looks like switching from NC12 to NC24 didn't help as we're still seeing the same error in #13245. https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=74406&view=logs&j=8ef866a1-0184-56dd-f082-b2a5a6f81852&t=4006504a-0df6-59ce-b931-f083a06c2a9c

@Borda
Copy link
Collaborator

Borda commented Jun 10, 2022

well, not sure then, but for while, all was almost fine (just one normally failing test) on master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continuous Integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants