Skip to content

"CUDA error, AssertionError: Tensor-likes are not close!" on model test for cuda-resnet101 #7618

@atalman

Description

@atalman

🐛 Describe the bug

Similar to: #7143

When switching CI from CUDA 11.7 to CUDA 11.8. Unit tests on Linux fails:
#7616

2023-05-23T13:41:31.5794788Z =================================== FAILURES ===================================
2023-05-23T13:41:31.5795148Z �[31m�[1m__________________ test_classification_model[cuda-resnet101] ___________________�[0m
2023-05-23T13:41:31.5795459Z Traceback (most recent call last):
2023-05-23T13:41:31.5795737Z   File "/work/test/test_models.py", line 705, in test_classification_model
2023-05-23T13:41:31.5796046Z     _assert_expected(out.cpu(), model_name, prec=prec)
2023-05-23T13:41:31.5796368Z   File "/work/test/test_models.py", line 155, in _assert_expected
2023-05-23T13:41:31.5796725Z     torch.testing.assert_close(output, expected, rtol=rtol, atol=atol, check_dtype=False, check_device=False)
2023-05-23T13:41:31.5797207Z   File "/opt/conda/envs/ci/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1511, in assert_close
2023-05-23T13:41:31.5797525Z     raise error_metas[0].to_error(msg)
2023-05-23T13:41:31.5797814Z AssertionError: Tensor-likes are not close!
2023-05-23T13:41:31.5797970Z 
2023-05-23T13:41:31.5798060Z Mismatched elements: 1 / 50 (2.0%)
2023-05-23T13:41:31.5798341Z Greatest absolute difference: 5.10198974609375 at index (0, 22) (up to 0.2 allowed)
2023-05-23T13:41:31.5798665Z Greatest relative difference: 0.2689853608608246 at index (0, 22) (up to 0.2 allowed)

Versions

nightly 2.1.0

cc @pmeier @NicolasHug @ptrblck @malfet @ngimel

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions