CI: debug using K80 #13245

Borda · 2022-06-07T11:38:30Z

What does this PR do?

seems we had in PL codebase test which was never executed

@RunIf(min_cuda_gpus=3)
def test_batch_size_smaller_than_num_gpus(tmpdir):
    ...

and running on a 4-GPU machine, it is failing
https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=742[…]b5f-c4ba606ae534&t=ad0b8e2f-da1f-5a7c-b4de-96aa939719e3&l=3333

Does your PR introduce any breaking changes? If yes, please list them.

https://docs.microsoft.com/en-us/azure/virtual-machines/nc-series

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @carmocca @akihironitta @Borda @tchaton @rohitgr7

akihironitta · 2022-06-08T01:27:21Z

Both LTS and stable are failing due to the failure of this test case:

tests/models/test_restore.py::test_running_test_pretrained_model_distrib_dp /__w/_temp/44087d0a-0039-482d-9fd9-b225849ec1e6.sh: line 1:   325 Aborted                 (core dumped) python -m coverage run --source pytorch_lightning -m pytest pytorch_lightning tests --ignore tests/benchmarks -v --junitxml=/__w/1/a/test-results.xml --durations=50
The STDIO streams did not close within 10 seconds of the exit event from process '/usr/bin/bash'. This may indicate a child process inherited the STDIO streams and has not yet exited.

https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=74104&view=logs&j=8ef866a1-0184-56dd-f082-b2a5a6f81852&t=1ce7fca7-ac37-5f5a-ccbd-e7a8418fcbe4&l=1580

awaelchli · 2022-06-21T10:49:31Z

This test is no longer relevant.

When the test was written originally, we had subclassed from the predecessor of BoringModel which returned an unreduced loss tensor. Today, Lightning only supports a scalar for the loss. The test started failing from when the refactor to BoringModel was made.
Returning a dict with key "progress_bar" is no longer the syntax for logging metrics. It gets passed to the epoch end hooks and no automatic reduction is applied. Instead we would use self.log with prog_bar=True today.
We have a hard requirement for progress bar metrics being logged as scalars: https://github.com/Lightning-AI/lightning/blob/55b0635a48489c5162796666073489d1b127c907/src/pytorch_lightning/trainer/connectors/logger_connector/result.py#L589-L590

Since the various code paths that were tested before no longer exist, I recommend dropping the test entirely.

Borda · 2022-06-22T23:48:37Z

trying to run CUDA 10.2 and seem it is failing at the very same place... :(

Borda · 2022-07-06T18:42:38Z

actual failer is

models/test_restore.py::test_running_test_pretrained_model_distrib_dp /__w/_temp/db606e4e-ba51-44d1-ba23-6e33ea4cbf4f.sh: line 1:   472 Aborted                 (core dumped) python -m coverage run --source pytorch_lightning -m pytest --ignore benchmarks -v --junitxml=/__w/1/a/test-results.xml --durations=50
The STDIO streams did not close within 10 seconds of the exit event from process '/bin/bash'. This may indicate a child process inherited the STDIO streams and has not yet exited.

most likely on: models/test_restore.py::test_running_test_pretrained_model_distrib_dp XXXXXX [ 48%]

for more information, see https://pre-commit.ci

Borda · 2022-07-07T07:28:52Z

I am thinking the problem is that the K80 are shared / virtual cards

akihironitta · 2022-07-12T11:59:27Z

@Borda Thank you very much for investigating this... Other instance types we could try are Standard_NC64as_T4_v3 which has four T4 cards or NV-series and NVv3-series that have M60 cards. (but M60 has two cards inside according to this might lead to the same error as K80.)

carmocca · 2022-07-28T20:02:39Z

Closing as we couldn't find the cause and we chose to split the testing anyways

Borda mentioned this pull request Jun 7, 2022

Create hpu-ci-runner Dockerfile #13239

Merged

12 tasks

akihironitta mentioned this pull request Jun 7, 2022

Temporarily pin docker image used in GPU CI #13218

Closed

7 tasks

Borda force-pushed the ci/gpu branch from e58abeb to 95cf421 Compare June 8, 2022 22:27

Borda changed the title ~~CI: debugging GPU test for K80~~ tests: debug test_batch_size_smaller_than_num_gpus for 4-GPUs Jun 9, 2022

Borda added the tests label Jun 9, 2022

Borda marked this pull request as ready for review June 9, 2022 15:02

Borda requested review from SeanNaren, akihironitta, awaelchli, carmocca, justusschock, kaushikb11, rohitgr7, tchaton and williamFalcon as code owners June 9, 2022 15:02

Borda added this to the 1.6.x milestone Jun 9, 2022

carmocca approved these changes Jun 9, 2022

View reviewed changes

mergify bot added the has conflicts label Jun 15, 2022

Borda force-pushed the ci/gpu branch from 2dc7ae3 to c2c2428 Compare June 21, 2022 08:26

mergify bot removed the has conflicts label Jun 21, 2022

mergify bot added has conflicts and removed has conflicts labels Jun 21, 2022

Borda force-pushed the ci/gpu branch from 878acd3 to ea10387 Compare June 22, 2022 22:07

Borda added ci Continuous Integration priority: 1 Medium priority task labels Jun 29, 2022

Borda force-pushed the ci/gpu branch from 9caba6a to ec91b82 Compare July 1, 2022 22:40

ci: use CUDA_LAUNCH_BLOCKING

a7dbc5b

Borda and others added 8 commits July 6, 2022 18:59

standalone

7fdf169

ddp

ccdb6c6

- CUDA_LAUNCH_BLOCKING

9739c93

Mark as standalone

0ffd297

dockers

ffc4d1b

azure

a6d51e0

pt 1.9

f8d3e58

xla

4c34d4b

Borda force-pushed the ci/gpu branch from ec91b82 to 4c34d4b Compare July 6, 2022 17:00

Borda changed the title ~~tests: debug test_batch_size_smaller_than_num_gpus for 4-GPUs~~ CI: debug using K80 Jul 6, 2022

push

11312d3

Borda force-pushed the ci/gpu branch from 582ca08 to 11312d3 Compare July 6, 2022 18:56

Borda added 6 commits July 6, 2022 22:10

PT 1.9

4609b7f

OPENBLAS_NUM_THREADS

6b7cdc3

...

cd64484

standalone?

fd166dc

standalone?

a7f8511

standalone?

6514107

Borda marked this pull request as draft July 7, 2022 06:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

d98b37c

for more information, see https://pre-commit.ci

Borda requested a review from carmocca July 12, 2022 12:03

rohitgr7 mentioned this pull request Jul 12, 2022

Remove redundant GPU test #13623

Merged

11 tasks

carmocca mentioned this pull request Jul 12, 2022

[WIP] Use k80s in GPU CI #13626

Closed

10 tasks

carmocca closed this Jul 28, 2022

carmocca deleted the ci/gpu branch July 28, 2022 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI: debug using K80 #13245

CI: debug using K80 #13245

Uh oh!

Borda commented Jun 7, 2022 •

edited

Loading

Uh oh!

akihironitta commented Jun 8, 2022

Uh oh!

awaelchli commented Jun 21, 2022 •

edited

Loading

Uh oh!

Borda commented Jun 22, 2022

Uh oh!

Borda commented Jul 6, 2022

Uh oh!

Borda commented Jul 7, 2022

Uh oh!

akihironitta commented Jul 12, 2022

Uh oh!

carmocca commented Jul 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CI: debug using K80 #13245

CI: debug using K80 #13245

Uh oh!

Conversation

Borda commented Jun 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

Uh oh!

akihironitta commented Jun 8, 2022

Uh oh!

awaelchli commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Borda commented Jun 22, 2022

Uh oh!

Borda commented Jul 6, 2022

Uh oh!

Borda commented Jul 7, 2022

Uh oh!

akihironitta commented Jul 12, 2022

Uh oh!

carmocca commented Jul 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Borda commented Jun 7, 2022 •

edited

Loading

awaelchli commented Jun 21, 2022 •

edited

Loading