Skip to content

Conversation

@akihironitta
Copy link
Contributor

@akihironitta akihironitta commented Mar 17, 2022

What does this PR do?

Follows up #12368 and closes #12314.

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

Comment on lines 117 to 120
horovodrun --check-build && \
pip uninstall -y horovod && \
pip install --no-cache-dir -r ./requirements/horovod.txt && \
horovodrun --check-build && \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reinstalling it somehow fixes the error...🤔

result here
root@b43ef4616cac:/# horovodrun --check-build
Horovod v0.24.2:

Available Frameworks:
    [ ] TensorFlow
    [X] PyTorch
    [ ] MXNet

Available Controllers:
    [ ] MPI
    [ ] Gloo

Available Tensor Operations:
    [ ] NCCL
    [ ] DDL
    [ ] CCL
    [ ] MPI
    [ ] Gloo
root@b43ef4616cac:/# python -c "import horovod.torch"
Extension horovod.torch has not been built: /usr/local/lib/python3.7/dist-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still available.
root@b43ef4616cac:/# pip uninstall -y horovod
...
root@b43ef4616cac:/# pip install --no-cache-dir horovod
...
root@b43ef4616cac:/# horovodrun --check-build
Horovod v0.24.2:

Available Frameworks:
    [ ] TensorFlow
    [X] PyTorch
    [ ] MXNet

Available Controllers:
    [ ] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [ ] MPI
    [X] Gloo
root@b43ef4616cac:/# python -c "import horovod.torch"
root@b43ef4616cac:/#

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akihironitta at this point, anything that makes it work is good enough :))
Did you push the image? You may need to retrigger the GPU CI job to pick up the changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on it!

@akihironitta

This comment was marked as outdated.

@akihironitta
Copy link
Contributor Author

akihironitta commented Mar 17, 2022

The built image has been successfully pushed to the hub, and GPU CI doesn't have the issue #12314 anymore, but I'm seeing another issue...

...
tests/callbacks/test_pruning.py::test_pruning_callback_ddp_spawn FAILED  [ 12%]
...
tests/callbacks/test_quantization.py::test_quantization[True-True-average] /__w/_temp/6c6a26ed-fe5c-4e9d-83d9-a3a6a6b6dabb.sh: line 1:   343 Segmentation fault      (core dumped) python -m coverage run --source pytorch_lightning -m pytest pytorch_lightning tests --ignore tests/benchmarks -v --junitxml=/__w/1/a/test-results.xml --durations=50
##[error]Bash exited with code '139'.

https://dev.azure.com/PytorchLightning/pytorch-lightning/_build/results?buildId=61463&view=logs&j=3afc50db-e620-5b81-6016-870a6976ad29&t=8b07f9df-ad34-5ead-2e0d-a2d35ff7ad3a

@akihironitta akihironitta added the ci Continuous Integration label Mar 17, 2022
@akihironitta
Copy link
Contributor Author

akihironitta commented Mar 17, 2022

UPDATE: Couldn't reproduce the above error with the exact same docker image in our cluster. Will try to debug it in the CI...

@awaelchli
Copy link
Contributor

@akihironitta maybe we try to skip the test to see if it is this particular test itself causing the problem or something else.

@Borda
Copy link
Collaborator

Borda commented Mar 18, 2022

seems I found the bug, it was my typo in the past missing check for correct build #12318 (comment)

@Borda Borda mentioned this pull request Mar 18, 2022
12 tasks
@akihironitta akihironitta deleted the ci/fix-cuda-horovod branch March 18, 2022 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continuous Integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AttributeError: module 'horovod.torch' has no attribute 'nccl_built'

3 participants