Fix killed conda 1.10 CI workflow #10077

awaelchli · 2021-10-22T09:05:56Z

What does this PR do?

We have conda 1.10 workflow crashing randomly: https://github.com/PyTorchLightning/pytorch-lightning/runs/3971666315?check_suite_focus=true

This is the error:

tests/helpers/test_models.py::test_models[None-ParityModuleRNN] PASSED   [ 31%]
/__w/_temp/2a0a0b32-902c-4e18-bf17-5c09efcb74cf.sh: line 2:    90 Killed                  coverage run --source pytorch_lightning -m pytest pytorch_lightning tests -v --durations=50 --junitxml=junit/test-results-Linux-torch1.10.xml
tests/helpers/test_models.py::test_models[None-ParityModuleMNIST] 
Error: Process completed with exit code 137.

RIP

This is the list of slowest tests in the 1.10 workflow:
135.69s call     tests/helpers/test_models.py::test_models[None-ParityModuleMNIST]
2626
54.44s call     tests/utilities/test_auto_restart.py::test_fast_forward_sampler_with_distributed_sampler_and_iterative_dataset
2627
38.72s call     tests/callbacks/test_quantization.py::test_quantization[False-False-average]
2628
29.42s call     tests/deprecated_api/test_remove_1-7.py::test_v1_7_0_deprecate_add_get_queue
2629
29.23s call     tests/callbacks/test_quantization.py::test_quantization[False-True-average]
2630
28.83s call     tests/callbacks/test_quantization.py::test_quantization[True-True-average]
2631
28.43s call     tests/checkpointing/test_torch_saving.py::test_model_torch_save_ddp_cpu
2632
26.92s call     tests/callbacks/test_quantization.py::test_quantization[False-True-histogram]
2633
26.00s call     tests/callbacks/test_quantization.py::test_quantization[True-False-average]
2634
25.44s call     tests/checkpointing/test_model_checkpoint.py::test_model_checkpoint_no_extraneous_invocations
2635
24.25s call     tests/trainer/test_trainer.py::test_model_in_correct_mode_during_stages[ddp_cpu-2]
2636
23.94s call     tests/core/test_results.py::test_result_reduce_ddp
2637
23.87s call     tests/plugins/test_ddp_spawn_plugin.py::test_ddp_spawn_configure_ddp
2638
22.44s call     tests/trainer/test_trainer.py::test_error_handling_all_stages[ddp_spawn-2]
2639
20.90s call     tests/callbacks/test_quantization.py::test_quantization[True-False-histogram]
2640
20.70s call     tests/callbacks/test_quantization.py::test_quantization[True-True-histogram]135.69s call     tests/helpers/test_models.py::test_models[None-ParityModuleMNIST]
2626
54.44s call     tests/utilities/test_auto_restart.py::test_fast_forward_sampler_with_distributed_sampler_and_iterative_dataset
2627
38.72s call     tests/callbacks/test_quantization.py::test_quantization[False-False-average]
2628
29.42s call     tests/deprecated_api/test_remove_1-7.py::test_v1_7_0_deprecate_add_get_queue
2629
29.23s call     tests/callbacks/test_quantization.py::test_quantization[False-True-average]
2630
28.83s call     tests/callbacks/test_quantization.py::test_quantization[True-True-average]
2631
28.43s call     tests/checkpointing/test_torch_saving.py::test_model_torch_save_ddp_cpu
2632
26.92s call     tests/callbacks/test_quantization.py::test_quantization[False-True-histogram]
2633
26.00s call     tests/callbacks/test_quantization.py::test_quantization[True-False-average]
2634
25.44s call     tests/checkpointing/test_model_checkpoint.py::test_model_checkpoint_no_extraneous_invocations
2635
24.25s call     tests/trainer/test_trainer.py::test_model_in_correct_mode_during_stages[ddp_cpu-2]
2636
23.94s call     tests/core/test_results.py::test_result_reduce_ddp
2637
23.87s call     tests/plugins/test_ddp_spawn_plugin.py::test_ddp_spawn_configure_ddp
2638
22.44s call     tests/trainer/test_trainer.py::test_error_handling_all_stages[ddp_spawn-2]
2639
20.90s call     tests/callbacks/test_quantization.py::test_quantization[True-False-histogram]
2640
20.70s call     tests/callbacks/test_quantization.py::test_quantization[True-True-histogram]
compare that to 1.9:
64.49s call     tests/trainer/test_trainer.py::test_model_in_correct_mode_during_stages[ddp_cpu-2]
2626
41.89s call     tests/utilities/test_auto_restart.py::test_fast_forward_sampler_with_distributed_sampler_and_iterative_dataset
2627
32.96s call     tests/plugins/test_ddp_spawn_plugin.py::test_ddp_spawn_configure_ddp
2628
32.09s call     tests/plugins/test_ddp_spawn_plugin.py::test_ddp_cpu
2629
22.45s call     tests/profiler/test_profiler.py::test_simple_profiler_distributed_files
2630
21.72s call     tests/models/test_horovod.py::test_horovod_cpu_clip_grad_by_value
2631
19.48s call     tests/models/test_horovod.py::test_horovod_cpu
2632
19.20s call     tests/trainer/test_data_loading.py::test_dataloader_warnings[1]
2633
18.29s call     tests/trainer/test_trainer.py::test_error_handling_all_stages[ddp_spawn-2]
2634
17.85s call     tests/deprecated_api/test_remove_1-7.py::test_v1_7_0_deprecate_add_get_queue
2635
17.81s call     tests/plugins/test_ddp_spawn_plugin.py::test_ddp_spawn_add_get_queue
2636
14.95s call     tests/callbacks/test_pruning.py::test_pruning_callback_ddp_cpu
2637
13.21s call     tests/trainer/test_trainer.py::test_trainer_predict_ddp_cpu
2638
13.12s call     tests/callbacks/test_quantization.py::test_quantization[False-True-average]
2639
11.73s call     tests/models/test_horovod.py::test_horovod_cpu_implicit
2640
11.48s call     tests/trainer/test_trainer.py::test_fit_test_synchronization
2641
11.42s call     tests/callbacks/test_quantization.py::test_quantization[False-False-histogram]
2642
10.71s call     tests/helpers/test_models.py::test_models[None-ParityModuleMNIST]
2643
10.69s call     tests/callbacks/test_early_stopping.py::test_multiple_early_stopping_callbacks[callbacks2-3-False-ddp_cpu-2]
2644
10.02s call     tests/callbacks/test_quantization.py::test_quantization[True-True-histogram]
2645
10.01s call     tests/callbacks/test_quantization.py::test_quantization[True-False-average]
2646
9.80s call     tests/callbacks/test_early_stopping.py::test_multiple_early_stopping_callbacks[callbacks6-3-True-ddp_cpu-2]
2647
9.69s call     tests/callbacks/test_quantization.py::test_quantization[True-True-average]
2648
9.44s call     tests/callbacks/test_quantization.py::test_quantization[True-False-histogram]
2649
9.25s call     tests/callbacks/test_quantization.py::test_quantization[False-True-histogram]

Adrian
It's entirely possible this is a problem with our test suite. I am posting here in case someone has an idea how to attack this problem.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

Borda · 2021-10-22T09:36:56Z

seems the fix is working as expected :]

github-actions · 2021-10-22T09:37:27Z

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

github-actions · 2021-10-22T09:37:48Z

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

.github/workflows/ci_test-conda.yml

tchaton

LGTM !

Co-authored-by: Carlos Mocholí <[email protected]>

github-actions · 2021-10-22T09:47:02Z

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

github-actions · 2021-10-22T10:14:26Z

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

akihironitta

Seems like it's still failing https://github.com/PyTorchLightning/pytorch-lightning/runs/3974228641?check_suite_focus=true#step:6:2364

awaelchli · 2021-10-22T10:54:02Z

Yes, it worked before.
I am devastated.
Random failures.

akihironitta · 2021-10-22T11:36:01Z

.github/workflows/ci_test-conda.yml

-    container: pytorchlightning/pytorch_lightning:base-conda-py${{ matrix.python-version }}-torch${{ matrix.pytorch-version }}
+    container:
+      image: pytorchlightning/pytorch_lightning:base-conda-py${{ matrix.python-version }}-torch${{ matrix.pytorch-version }}
+      options: --shm-size=1G


I remember it worked in Bolts a while back ago by setting --ipc=host (which is another way of increase shm as stated in pytorch's readme) instead of setting specific size. Shall we try this?

Suggested change

options: --shm-size=1G

options: --ipc=host

Just applied the change here. Let's see if it works...

Ok, it didn't work either...

tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[CSVLogger] PASSED [ 31%] /__w/_temp/bdfbf602-7035-441f-aa67-5db678b2f0ea.sh: line 2: 93 Killed coverage run --source pytorch_lightning -m pytest pytorch_lightning tests -v --durations=50 --junitxml=junit/test-results-Linux-torch1.10.xml tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[MLFlowLogger] Error: Process completed with exit code 137.

https://github.com/PyTorchLightning/pytorch-lightning/runs/3975236048?check_suite_focus=true#step:6:844

github-actions · 2021-10-22T13:32:32Z

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

github-actions · 2021-10-22T13:32:48Z

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

daniellepintz · 2021-10-25T06:16:42Z

I am seeing the same issue in #10113 and #10112. in the second one it is also timing out for conda (3.7, 1.9) (https://github.com/PyTorchLightning/pytorch-lightning/runs/3992152549?check_suite_focus=true), so it seems the issue is not specific to 1.10

do we have an issue tracking this?

akihironitta · 2021-10-25T12:27:03Z

@daniellepintz I don't think we do. ~~I'm creating one now.~~ Created #10129

awaelchli added 2 commits October 22, 2021 11:04

update

57697f7

update

886f8ae

Borda approved these changes Oct 22, 2021

View reviewed changes

awaelchli added argparse (removed) Related to argument parsing (argparse, Hydra, ...) ci Continuous Integration and removed argparse (removed) Related to argument parsing (argparse, Hydra, ...) labels Oct 22, 2021

awaelchli added the priority: 0 High priority task label Oct 22, 2021

awaelchli marked this pull request as ready for review October 22, 2021 09:37

awaelchli requested review from carmocca and tchaton as code owners October 22, 2021 09:37

awaelchli changed the title ~~Attempt to fix killed conda 1.10 CI workflow~~ Fix killed conda 1.10 CI workflow Oct 22, 2021

kaushikb11 approved these changes Oct 22, 2021

View reviewed changes

mergify bot added the ready PRs ready to be merged label Oct 22, 2021

mergify bot requested a review from a team October 22, 2021 09:38

Lightning-AI deleted a comment from github-actions bot Oct 22, 2021

carmocca reviewed Oct 22, 2021

View reviewed changes

.github/workflows/ci_test-conda.yml Outdated Show resolved Hide resolved

mergify bot requested a review from a team October 22, 2021 09:41

tchaton approved these changes Oct 22, 2021

View reviewed changes

rerun test

1b48815

Co-authored-by: Carlos Mocholí <[email protected]>

tchaton enabled auto-merge (squash) October 22, 2021 09:44

SeanNaren approved these changes Oct 22, 2021

View reviewed changes

justusschock approved these changes Oct 22, 2021

View reviewed changes

akihironitta approved these changes Oct 22, 2021

View reviewed changes

akihironitta reviewed Oct 22, 2021

View reviewed changes

akihironitta removed the ready PRs ready to be merged label Oct 22, 2021

mergify bot added the ready PRs ready to be merged label Oct 22, 2021

carmocca disabled auto-merge October 22, 2021 11:36

akihironitta reviewed Oct 22, 2021

View reviewed changes

Update .github/workflows/ci_test-conda.yml

d0de39a

SkafteNicki approved these changes Oct 22, 2021

View reviewed changes

awaelchli closed this Oct 22, 2021

awaelchli deleted the ci/110-issue branch October 22, 2021 20:04

akihironitta mentioned this pull request Oct 25, 2021

CI: conda 1.9 and 1.10 getting killed #10129

Closed

Fix killed conda 1.10 CI workflow #10077

Fix killed conda 1.10 CI workflow #10077

Uh oh!

Conversation

awaelchli commented Oct 22, 2021 • edited by tchaton Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

PR review

Did you have fun?

Uh oh!

Borda commented Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 22, 2021

Uh oh!

github-actions bot commented Oct 22, 2021

Uh oh!

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 22, 2021

Uh oh!

github-actions bot commented Oct 22, 2021

Uh oh!

akihironitta left a comment

Choose a reason for hiding this comment

Uh oh!

awaelchli commented Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akihironitta Oct 22, 2021

Choose a reason for hiding this comment

Uh oh!

akihironitta Oct 22, 2021

Choose a reason for hiding this comment

Uh oh!

akihironitta Oct 22, 2021

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 22, 2021

Uh oh!

github-actions bot commented Oct 22, 2021

Uh oh!

daniellepintz commented Oct 25, 2021

Uh oh!

akihironitta commented Oct 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

awaelchli commented Oct 22, 2021 •

edited by tchaton

Loading

Borda commented Oct 22, 2021 •

edited

Loading

awaelchli commented Oct 22, 2021 •

edited

Loading

akihironitta commented Oct 25, 2021 •

edited

Loading