Skip to content

Conversation

@awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Oct 22, 2021

What does this PR do?

We have conda 1.10 workflow crashing randomly: https://github.com/PyTorchLightning/pytorch-lightning/runs/3971666315?check_suite_focus=true

This is the error:

tests/helpers/test_models.py::test_models[None-ParityModuleRNN] PASSED   [ 31%]
/__w/_temp/2a0a0b32-902c-4e18-bf17-5c09efcb74cf.sh: line 2:    90 Killed                  coverage run --source pytorch_lightning -m pytest pytorch_lightning tests -v --durations=50 --junitxml=junit/test-results-Linux-torch1.10.xml
tests/helpers/test_models.py::test_models[None-ParityModuleMNIST] 
Error: Process completed with exit code 137.

RIP

This is the list of slowest tests in the 1.10 workflow:
135.69s call     tests/helpers/test_models.py::test_models[None-ParityModuleMNIST]
2626
54.44s call     tests/utilities/test_auto_restart.py::test_fast_forward_sampler_with_distributed_sampler_and_iterative_dataset
2627
38.72s call     tests/callbacks/test_quantization.py::test_quantization[False-False-average]
2628
29.42s call     tests/deprecated_api/test_remove_1-7.py::test_v1_7_0_deprecate_add_get_queue
2629
29.23s call     tests/callbacks/test_quantization.py::test_quantization[False-True-average]
2630
28.83s call     tests/callbacks/test_quantization.py::test_quantization[True-True-average]
2631
28.43s call     tests/checkpointing/test_torch_saving.py::test_model_torch_save_ddp_cpu
2632
26.92s call     tests/callbacks/test_quantization.py::test_quantization[False-True-histogram]
2633
26.00s call     tests/callbacks/test_quantization.py::test_quantization[True-False-average]
2634
25.44s call     tests/checkpointing/test_model_checkpoint.py::test_model_checkpoint_no_extraneous_invocations
2635
24.25s call     tests/trainer/test_trainer.py::test_model_in_correct_mode_during_stages[ddp_cpu-2]
2636
23.94s call     tests/core/test_results.py::test_result_reduce_ddp
2637
23.87s call     tests/plugins/test_ddp_spawn_plugin.py::test_ddp_spawn_configure_ddp
2638
22.44s call     tests/trainer/test_trainer.py::test_error_handling_all_stages[ddp_spawn-2]
2639
20.90s call     tests/callbacks/test_quantization.py::test_quantization[True-False-histogram]
2640
20.70s call     tests/callbacks/test_quantization.py::test_quantization[True-True-histogram]135.69s call     tests/helpers/test_models.py::test_models[None-ParityModuleMNIST]
2626
54.44s call     tests/utilities/test_auto_restart.py::test_fast_forward_sampler_with_distributed_sampler_and_iterative_dataset
2627
38.72s call     tests/callbacks/test_quantization.py::test_quantization[False-False-average]
2628
29.42s call     tests/deprecated_api/test_remove_1-7.py::test_v1_7_0_deprecate_add_get_queue
2629
29.23s call     tests/callbacks/test_quantization.py::test_quantization[False-True-average]
2630
28.83s call     tests/callbacks/test_quantization.py::test_quantization[True-True-average]
2631
28.43s call     tests/checkpointing/test_torch_saving.py::test_model_torch_save_ddp_cpu
2632
26.92s call     tests/callbacks/test_quantization.py::test_quantization[False-True-histogram]
2633
26.00s call     tests/callbacks/test_quantization.py::test_quantization[True-False-average]
2634
25.44s call     tests/checkpointing/test_model_checkpoint.py::test_model_checkpoint_no_extraneous_invocations
2635
24.25s call     tests/trainer/test_trainer.py::test_model_in_correct_mode_during_stages[ddp_cpu-2]
2636
23.94s call     tests/core/test_results.py::test_result_reduce_ddp
2637
23.87s call     tests/plugins/test_ddp_spawn_plugin.py::test_ddp_spawn_configure_ddp
2638
22.44s call     tests/trainer/test_trainer.py::test_error_handling_all_stages[ddp_spawn-2]
2639
20.90s call     tests/callbacks/test_quantization.py::test_quantization[True-False-histogram]
2640
20.70s call     tests/callbacks/test_quantization.py::test_quantization[True-True-histogram]
compare that to 1.9:
64.49s call     tests/trainer/test_trainer.py::test_model_in_correct_mode_during_stages[ddp_cpu-2]
2626
41.89s call     tests/utilities/test_auto_restart.py::test_fast_forward_sampler_with_distributed_sampler_and_iterative_dataset
2627
32.96s call     tests/plugins/test_ddp_spawn_plugin.py::test_ddp_spawn_configure_ddp
2628
32.09s call     tests/plugins/test_ddp_spawn_plugin.py::test_ddp_cpu
2629
22.45s call     tests/profiler/test_profiler.py::test_simple_profiler_distributed_files
2630
21.72s call     tests/models/test_horovod.py::test_horovod_cpu_clip_grad_by_value
2631
19.48s call     tests/models/test_horovod.py::test_horovod_cpu
2632
19.20s call     tests/trainer/test_data_loading.py::test_dataloader_warnings[1]
2633
18.29s call     tests/trainer/test_trainer.py::test_error_handling_all_stages[ddp_spawn-2]
2634
17.85s call     tests/deprecated_api/test_remove_1-7.py::test_v1_7_0_deprecate_add_get_queue
2635
17.81s call     tests/plugins/test_ddp_spawn_plugin.py::test_ddp_spawn_add_get_queue
2636
14.95s call     tests/callbacks/test_pruning.py::test_pruning_callback_ddp_cpu
2637
13.21s call     tests/trainer/test_trainer.py::test_trainer_predict_ddp_cpu
2638
13.12s call     tests/callbacks/test_quantization.py::test_quantization[False-True-average]
2639
11.73s call     tests/models/test_horovod.py::test_horovod_cpu_implicit
2640
11.48s call     tests/trainer/test_trainer.py::test_fit_test_synchronization
2641
11.42s call     tests/callbacks/test_quantization.py::test_quantization[False-False-histogram]
2642
10.71s call     tests/helpers/test_models.py::test_models[None-ParityModuleMNIST]
2643
10.69s call     tests/callbacks/test_early_stopping.py::test_multiple_early_stopping_callbacks[callbacks2-3-False-ddp_cpu-2]
2644
10.02s call     tests/callbacks/test_quantization.py::test_quantization[True-True-histogram]
2645
10.01s call     tests/callbacks/test_quantization.py::test_quantization[True-False-average]
2646
9.80s call     tests/callbacks/test_early_stopping.py::test_multiple_early_stopping_callbacks[callbacks6-3-True-ddp_cpu-2]
2647
9.69s call     tests/callbacks/test_quantization.py::test_quantization[True-True-average]
2648
9.44s call     tests/callbacks/test_quantization.py::test_quantization[True-False-histogram]
2649
9.25s call     tests/callbacks/test_quantization.py::test_quantization[False-True-histogram]

Adrian
It's entirely possible this is a problem with our test suite. I am posting here in case someone has an idea how to attack this problem.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

@awaelchli awaelchli added argparse (removed) Related to argument parsing (argparse, Hydra, ...) ci Continuous Integration and removed argparse (removed) Related to argument parsing (argparse, Hydra, ...) labels Oct 22, 2021
@Borda
Copy link
Collaborator

Borda commented Oct 22, 2021

seems the fix is working as expected :]

@awaelchli awaelchli added the priority: 0 High priority task label Oct 22, 2021
@awaelchli awaelchli marked this pull request as ready for review October 22, 2021 09:37
@github-actions
Copy link
Contributor

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

@awaelchli awaelchli changed the title Attempt to fix killed conda 1.10 CI workflow Fix killed conda 1.10 CI workflow Oct 22, 2021
@github-actions
Copy link
Contributor

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

@mergify mergify bot added the ready PRs ready to be merged label Oct 22, 2021
@mergify mergify bot requested a review from a team October 22, 2021 09:38
@Lightning-AI Lightning-AI deleted a comment from github-actions bot Oct 22, 2021
@mergify mergify bot requested a review from a team October 22, 2021 09:41
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

Co-authored-by: Carlos Mocholí <[email protected]>
@tchaton tchaton enabled auto-merge (squash) October 22, 2021 09:44
@github-actions
Copy link
Contributor

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

@github-actions
Copy link
Contributor

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

Copy link
Contributor

@akihironitta akihironitta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli
Copy link
Contributor Author

awaelchli commented Oct 22, 2021

Yes, it worked before.
I am devastated.
Random failures.

@akihironitta akihironitta removed the ready PRs ready to be merged label Oct 22, 2021
@mergify mergify bot added the ready PRs ready to be merged label Oct 22, 2021
@carmocca carmocca disabled auto-merge October 22, 2021 11:36
container: pytorchlightning/pytorch_lightning:base-conda-py${{ matrix.python-version }}-torch${{ matrix.pytorch-version }}
container:
image: pytorchlightning/pytorch_lightning:base-conda-py${{ matrix.python-version }}-torch${{ matrix.pytorch-version }}
options: --shm-size=1G
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember it worked in Bolts a while back ago by setting --ipc=host (which is another way of increase shm as stated in pytorch's readme) instead of setting specific size. Shall we try this?

Suggested change
options: --shm-size=1G
options: --ipc=host

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just applied the change here. Let's see if it works...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it didn't work either...

tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[CSVLogger] PASSED [ 31%]
/__w/_temp/bdfbf602-7035-441f-aa67-5db678b2f0ea.sh: line 2:    93 Killed                  coverage run --source pytorch_lightning -m pytest pytorch_lightning tests -v --durations=50 --junitxml=junit/test-results-Linux-torch1.10.xml
tests/loggers/test_all.py::test_logger_created_on_rank_zero_only[MLFlowLogger] 
Error: Process completed with exit code 137.

https://github.com/PyTorchLightning/pytorch-lightning/runs/3975236048?check_suite_focus=true#step:6:844

@github-actions
Copy link
Contributor

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

1 similar comment
@github-actions
Copy link
Contributor

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

@awaelchli awaelchli closed this Oct 22, 2021
@awaelchli awaelchli deleted the ci/110-issue branch October 22, 2021 20:04
@daniellepintz
Copy link
Contributor

I am seeing the same issue in #10113 and #10112. in the second one it is also timing out for conda (3.7, 1.9) (https://github.com/PyTorchLightning/pytorch-lightning/runs/3992152549?check_suite_focus=true), so it seems the issue is not specific to 1.10

do we have an issue tracking this?

@akihironitta
Copy link
Contributor

akihironitta commented Oct 25, 2021

@daniellepintz I don't think we do. I'm creating one now. Created #10129

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continuous Integration priority: 0 High priority task ready PRs ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants