Shared library loading logic breaks when CUDA packages are installed in a non-standard location

### 🐛 Describe the bug

**tl;dr:** Some CUDA libraries are distributed alongside Torch via PyPI packages. These packages include `nvidia-cudnn-cu11`, `nvidia-cusparse-cu11`, and so on. Torch's `__init__.py` has various tricks to find and load these libraries, but one of these tricks break when Torch is installed in a different location to the `nvidia-*` packages. This could be fixed by linking all of Torch's CUDA dependencies into `libtorch_global_deps.so`.

----

**Longer version:**

I'm using Torch PyPI with the [pants](https://www.pantsbuild.org/) build system, which creates Python environments with a slightly weird layout. Specifically, each package ends up in its own directory, rather than everything landing in `site-packages` like it would in a virtualenv. This causes problems when I attempt to import PyTorch 2.0.0:

```
ImportError                               Traceback (most recent call last)
<ipython-input-20-eb42ca6e4af3> in <cell line: 1>()
----> 1 import torch

~/.cache/pants/named_caches/pex_root/installed_wheels/6befaad784004b7af357e3d87fa0863c1f642866291f12a4c2af2de435e8ac5c/torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl/torch/__init__.py in <module>
--> 239     from torch._C import *  # noqa: F403
    240 
    241 # Appease the type checker; ordinarily this binding is inserted by the

ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory
```

I think this may point at an issue with the shared library loading logic in Torch. Specifically, `_load_global_deps()` in Torch's `__init__.py` has this logic that first attempts to load globals deps from `libtorch_global_deps.so`, and then attempts to load any missing libraries if the `CDLL()` call fails:

```python
# See Note [Global dependencies]
def _load_global_deps():
    # ... snip ...

    lib_name = 'libtorch_global_deps' + ('.dylib' if platform.system() == 'Darwin' else '.so')
    here = os.path.abspath(__file__)
    lib_path = os.path.join(os.path.dirname(here), 'lib', lib_name)

    try:
        ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
    except OSError as err:
        cuda_libs: Dict[str, str] = {
            'cublas': 'libcublas.so.*[0-9]',
            'cudnn': 'libcudnn.so.*[0-9]',
            'cuda_nvrtc': 'libnvrtc.so.*[0-9].*[0-9]',
            'cuda_runtime': 'libcudart.so.*[0-9].*[0-9]',
            'cuda_cupti': 'libcupti.so.*[0-9].*[0-9]',
            'cufft': 'libcufft.so.*[0-9]',
            'curand': 'libcurand.so.*[0-9]',
            'cusolver': 'libcusolver.so.*[0-9]', X
            'cusparse': 'libcusparse.so.*[0-9]', X
            'nccl': 'libnccl.so.*[0-9]', X
            'nvtx': 'libnvToolsExt.so.*[0-9]',
        }
        is_cuda_lib_err = [lib for lib in cuda_libs.values() if(lib.split('.')[0] in err.args[0])]
        # ... some more logic to load libs by looking through `sys.path` ...
```

On my system, the `CDLL()` call succeeds at loading `torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl/torch/lib/libtorch_global_deps.so`, so it returns immediately without attempting to load the libraries in the `cuda_libs` dict. However, that `.so` file only links to a subset of the libraries listed above:

```
$ ldd /long/path/to/torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl/torch/lib/libtorch_global_deps.so
        linux-vdso.so.1 (0x00007ffe3b7d1000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f6d85c92000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6d85b41000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6d85b3b000)
        libcurand.so.10 => /lib/x86_64-linux-gnu/libcurand.so.10 (0x00007f6d7ff4b000)
        libcufft.so.10 => /lib/x86_64-linux-gnu/libcufft.so.10 (0x00007f6d774be000)
        libcublas.so.11 => /lib/x86_64-linux-gnu/libcublas.so.11 (0x00007f6d6dd40000)
        libcublasLt.so.11 => /lib/x86_64-linux-gnu/libcublasLt.so.11 (0x00007f6d58cda000)
        libcudart.so.11.0 => /lib/x86_64-linux-gnu/libcudart.so.11.0 (0x00007f6d58a34000)
        libnvToolsExt.so.1 => /lib/x86_64-linux-gnu/libnvToolsExt.so.1 (0x00007f6d5882a000)
        libgomp-a34b3233.so.1 => /long/path/to/torch-2.0.0-cp39-cp39-manylinux1_x86_64.whl/torch/lib/libgomp-a34b3233.so.1 (0x00007f6d58600000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6d5840e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f6d85cd9000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f6d58404000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f6d583e7000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f6d58205000)
```

Some libraries from `cuda_libs` are missing from the `ldd` output. This is fine when the `nvidia-*` Python packages are installed in the same directory as Torch, because Python can Torch's RPATH  to find the packages. Specifically, the RPATH has a bunch of relative paths to the nvidia libraries, which look like this: 

```
$ORIGIN/../../nvidia/cublas/lib:$ORIGIN/../../nvidia/cuda_cupti/lib:$ORIGIN/../../nvidia/cuda_nvrtc/lib:$ORIGIN/../../nvidia/cuda_runtime/lib:$ORIGIN/../../nvidia/cudnn/lib:$ORIGIN/../../nvidia/cufft/lib:$ORIGIN/../../nvidia/curand/lib:$ORIGIN/../../nvidia/cusolver/lib:$ORIGIN/../../nvidia/cusparse/lib:$ORIGIN/../../nvidia/nccl/lib:$ORIGIN/../../nvidia/nvtx/lib:$ORIGIN
```

Unfortunately these relative paths do not work when Torch is installed in a different directory to the `nvidia-*` packages, which is the case for me.

`__init__.py` already has the logic necessary to fix this problem by scanning `sys.path` for the missing libraries. However, that logic currently only gets triggered when the `libtorch_global_deps` import fails. When I modify the code to always look for these libraries, I can import PyTorch again:

```python
    try:
        ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
        raise OSError("libcudnn libnvrtc libcupti libcusolver libcusparse libnccl")  # always look for these libraries
    except OSError as err:
        cuda_libs: Dict[str, str] = {
            # ... etc. ...
```

Ideally `__init__.py` should use a more robust test to determine whether `libcudnn` and friends can be loaded. Probably the easiest fix is to link all the libs from `cuda_libs` into `libtorch_global_deps`.

### Versions

Collecting environment information...                           
PyTorch version: N/A              
Is debug build: N/A                     
CUDA used to build PyTorch: N/A          
ROCM used to build PyTorch: N/A           
                                                                                                              
OS: Ubuntu 20.04.6 LTS (x86_64)         
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: 10.0.0-4ubuntu1 
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.9.5 (default, Nov 23 2021, 15:27:38)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-125-generic-x86_64-with-glibc2.31
Is CUDA available: N/A
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: 
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000
GPU 4: NVIDIA RTX A6000
GPU 5: NVIDIA RTX A6000
GPU 6: NVIDIA RTX A6000
GPU 7: NVIDIA RTX A6000

Nvidia driver version: 510.60.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           1
Model name:                      AMD EPYC 7763 64-Core Processor
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         3249.791
CPU max MHz:                     2450.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        4900.34
Virtualization:                  AMD-V
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        512 MiB
NUMA node0 CPU(s):               0-63
NUMA node1 CPU(s):               64-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tc
e topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_
vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] flake8==3.7.9
[pip3] numpy==1.17.4
[conda] No relevant packages

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shared library loading logic breaks when CUDA packages are installed in a non-standard location #101314

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shared library loading logic breaks when CUDA packages are installed in a non-standard location #101314

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions