-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[wip] fix horovod installation #12353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
11883f7 to
9fa4e21
Compare
dockers/base-cuda/Dockerfile
Outdated
| horovodrun --check-build && \ | ||
| pip uninstall -y horovod && \ | ||
| pip install --no-cache-dir -r ./requirements/horovod.txt && \ | ||
| horovodrun --check-build && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reinstalling it somehow fixes the error...🤔
result here
root@b43ef4616cac:/# horovodrun --check-build
Horovod v0.24.2:
Available Frameworks:
[ ] TensorFlow
[X] PyTorch
[ ] MXNet
Available Controllers:
[ ] MPI
[ ] Gloo
Available Tensor Operations:
[ ] NCCL
[ ] DDL
[ ] CCL
[ ] MPI
[ ] Gloo
root@b43ef4616cac:/# python -c "import horovod.torch"
Extension horovod.torch has not been built: /usr/local/lib/python3.7/dist-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-37m-x86_64-linux-gnu.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.
Warning! MPI libs are missing, but python applications are still available.
root@b43ef4616cac:/# pip uninstall -y horovod
...
root@b43ef4616cac:/# pip install --no-cache-dir horovod
...
root@b43ef4616cac:/# horovodrun --check-build
Horovod v0.24.2:
Available Frameworks:
[ ] TensorFlow
[X] PyTorch
[ ] MXNet
Available Controllers:
[ ] MPI
[X] Gloo
Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[ ] MPI
[X] Gloo
root@b43ef4616cac:/# python -c "import horovod.torch"
root@b43ef4616cac:/#There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akihironitta at this point, anything that makes it work is good enough :))
Did you push the image? You may need to retrigger the GPU CI job to pick up the changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on it!
This comment was marked as outdated.
This comment was marked as outdated.
|
The built image has been successfully pushed to the hub, and GPU CI doesn't have the issue #12314 anymore, but I'm seeing another issue... |
|
UPDATE: Couldn't reproduce the above error with the exact same docker image in our cluster. Will try to debug it in the CI... |
|
@akihironitta maybe we try to skip the test to see if it is this particular test itself causing the problem or something else. |
|
seems I found the bug, it was my typo in the past missing check for correct build #12318 (comment) |
What does this PR do?
Follows up #12368 and closes #12314.
Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃