Skip to content

Conversation

@Borda
Copy link
Collaborator

@Borda Borda commented Jun 7, 2022

What does this PR do?

seems we had in PL codebase test which was never executed

@RunIf(min_cuda_gpus=3)
def test_batch_size_smaller_than_num_gpus(tmpdir):
    ...

and running on a 4-GPU machine, it is failing
https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=742[…]b5f-c4ba606ae534&t=ad0b8e2f-da1f-5a7c-b4de-96aa939719e3&l=3333

Does your PR introduce any breaking changes? If yes, please list them.

https://docs.microsoft.com/en-us/azure/virtual-machines/nc-series

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @carmocca @akihironitta @Borda @tchaton @rohitgr7

@akihironitta
Copy link
Contributor

Both LTS and stable are failing due to the failure of this test case:

tests/models/test_restore.py::test_running_test_pretrained_model_distrib_dp /__w/_temp/44087d0a-0039-482d-9fd9-b225849ec1e6.sh: line 1:   325 Aborted                 (core dumped) python -m coverage run --source pytorch_lightning -m pytest pytorch_lightning tests --ignore tests/benchmarks -v --junitxml=/__w/1/a/test-results.xml --durations=50
The STDIO streams did not close within 10 seconds of the exit event from process '/usr/bin/bash'. This may indicate a child process inherited the STDIO streams and has not yet exited.

https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=74104&view=logs&j=8ef866a1-0184-56dd-f082-b2a5a6f81852&t=1ce7fca7-ac37-5f5a-ccbd-e7a8418fcbe4&l=1580

@Borda Borda changed the title CI: debugging GPU test for K80 tests: debug test_batch_size_smaller_than_num_gpus for 4-GPUs Jun 9, 2022
@Borda Borda added the tests label Jun 9, 2022
@Borda Borda marked this pull request as ready for review June 9, 2022 15:02
@Borda Borda added this to the 1.6.x milestone Jun 9, 2022
@awaelchli
Copy link
Contributor

awaelchli commented Jun 21, 2022

This test is no longer relevant.

  1. When the test was written originally, we had subclassed from the predecessor of BoringModel which returned an unreduced loss tensor. Today, Lightning only supports a scalar for the loss. The test started failing from when the refactor to BoringModel was made.
  2. Returning a dict with key "progress_bar" is no longer the syntax for logging metrics. It gets passed to the epoch end hooks and no automatic reduction is applied. Instead we would use self.log with prog_bar=True today.
  3. We have a hard requirement for progress bar metrics being logged as scalars: https://github.com/Lightning-AI/lightning/blob/55b0635a48489c5162796666073489d1b127c907/src/pytorch_lightning/trainer/connectors/logger_connector/result.py#L589-L590

Since the various code paths that were tested before no longer exist, I recommend dropping the test entirely.

@Borda
Copy link
Collaborator Author

Borda commented Jun 22, 2022

trying to run CUDA 10.2 and seem it is failing at the very same place... :(

@Borda Borda added ci Continuous Integration priority: 1 Medium priority task labels Jun 29, 2022
@Borda Borda changed the title tests: debug test_batch_size_smaller_than_num_gpus for 4-GPUs CI: debug using K80 Jul 6, 2022
@Borda
Copy link
Collaborator Author

Borda commented Jul 6, 2022

actual failer is

models/test_restore.py::test_running_test_pretrained_model_distrib_dp /__w/_temp/db606e4e-ba51-44d1-ba23-6e33ea4cbf4f.sh: line 1:   472 Aborted                 (core dumped) python -m coverage run --source pytorch_lightning -m pytest --ignore benchmarks -v --junitxml=/__w/1/a/test-results.xml --durations=50
The STDIO streams did not close within 10 seconds of the exit event from process '/bin/bash'. This may indicate a child process inherited the STDIO streams and has not yet exited.

most likely on: models/test_restore.py::test_running_test_pretrained_model_distrib_dp XXXXXX [ 48%]

@Borda Borda marked this pull request as draft July 7, 2022 06:39
@Borda
Copy link
Collaborator Author

Borda commented Jul 7, 2022

I am thinking the problem is that the K80 are shared / virtual cards

@akihironitta
Copy link
Contributor

@Borda Thank you very much for investigating this... Other instance types we could try are Standard_NC64as_T4_v3 which has four T4 cards or NV-series and NVv3-series that have M60 cards. (but M60 has two cards inside according to this might lead to the same error as K80.)

@Borda Borda requested a review from carmocca July 12, 2022 12:03
@rohitgr7 rohitgr7 mentioned this pull request Jul 12, 2022
11 tasks
@carmocca carmocca mentioned this pull request Jul 12, 2022
10 tasks
@carmocca
Copy link
Contributor

Closing as we couldn't find the cause and we chose to split the testing anyways

@carmocca carmocca closed this Jul 28, 2022
@carmocca carmocca deleted the ci/gpu branch July 28, 2022 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continuous Integration priority: 1 Medium priority task tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants