Skip to content

Conversation

@daisyden
Copy link
Owner

Fixes #ISSUE_NUMBER

test/test_ops.py Outdated
Comment on lines 111 to 116
if TEST_XPU:
any_common_cpu_device_one = OpDTypes.any_common_cpu_xpu_one
else:
any_common_cpu_device_one = OpDTypes.any_common_cpu_cuda_one


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to be a utility function rather than a if-else block.

Copy link
Owner Author

@daisyden daisyden May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with this function:
def any_common_cpu_device_one():
return OpDTypes.any_common_cpu_xpu_one if TEST_XPU else OpDTypes.any_common_cpu_cuda_one

test/test_ops.py Outdated
in ("signal_windows_exponential", "signal_windows_bartlett")
and dtype == torch.float64
and "cuda" in device
and ("cuda" in device or "xpu" in device)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abstract the check as a utility function, like has_gpu_device.

def has_gpu_device(devices: List[str]):
    return "cuda" in device or "xpu" in device

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def has_gpu_device(devices: List[str]):
return "cuda" in devices or "xpu" in devices

updated with this.

floating_and_complex_types,
floating_and_complex_types_and,
floating_types,
empty_types,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it serve?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

supports_rhs_python_scalar=True, # Whether the operator allows Tensor x scalar inputs
supports_one_python_scalar=False, # Whether the operator allows scalar x tensor and tensor x scalar inputs
supports_two_python_scalars=False, # Whether the operator allows scalar x scalar inputs
skipXPU=True,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not make sense to add such a parameter for xpu specific.

Copy link
Owner Author

@daisyden daisyden May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace it with "skipDevices=[xpu]' ? so that other devices can also use it to skip an op.

Copy link
Owner Author

@daisyden daisyden May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the skip mechanism:
In OpInfo class define two fields:

   # skip the test for a device
    skip_device: Tuple = tuple()

    # enable the test for a device
    enable_skipped_device: Tuple = tuple()

By default skip_device=('xpu', ), if enable_skipped_device=('xpu') is defined in op_db, xpu will be enabled, otherwise xpu will be added to skips.

self.skip_device = ('xpu',)
        for device in (set(self.skip_device).difference(set(self.enable_skipped_device))):
            if self.skips is not None:
                self.skips = (*self.skips, DecorateInfo(unittest.skip, device_type=device, dtypes=None))
            else:
                self.skips = (DecorateInfo(unittest.skip, device_type=device, dtypes=None))

The enable_skipped_device will be passed to OpsInfo __init__ and __post_init__ through kwargs, so no need to pass it manually in all the inherited class like BinaryUfuncInfo.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design is changed by using a util function and xpu_op_db.yaml to skip unsupported ops.

test/test_ops.py Outdated
elif torch.cuda.is_available():
wrong_device = "cuda"
elif torch.xpu.is_available():
# Daisy ????

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the comments.

test/test_ops.py Outdated
Comment on lines 2681 to 2686
instantiate_device_type_tests(TestCommon, globals(), only_for="xpu")
#instantiate_device_type_tests(TestCompositeCompliance, globals(), only_for="xpu")
#instantiate_device_type_tests(TestMathBits, globals())
#instantiate_device_type_tests(TestRefsOpsInfo, globals(), only_for="cpu")
#instantiate_device_type_tests(TestFakeTensor, globals())
#instantiate_device_type_tests(TestTags, globals())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the tests should include XPU but not only for XPU, right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the "only_for" is for testing.

Comment on lines 1378 to 1384
@wraps(fn)
def only_fn(self, *args, **kwargs):
if self.device_type not in ('cuda', 'xpu'):
reason = f"onlyCUDAAndXPU: doesn't run on {self.device_type}"
raise unittest.SkipTest(reason)

return fn(self, *args, **kwargs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this logic, we can refine onlyOn by extending it support device list. The code could be as follows.

class onlyOn:

    def __init__(self, device_type: Union[str, List[str]]):
        self.device_types = []
        if isinstance(device_type, str):
            self.device_types.append(device_type)
        else:
            assert isinstance(device_type, list)
            self.device_types = device_type

    def __call__(self, fn):

        @wraps(fn)
        def only_fn(slf, *args, **kwargs):
            if slf.device_type not in self.device_types:
                reason = f"Only runs on {self.device_types}"
                raise unittest.SkipTest(reason)

            return fn(slf, *args, **kwargs)

        return only_fn

With the refined onlyOn, the above code could be as follows.

def onlyCUDAAndXPU(fn):
    onlyOn(["cuda", "xpu"])(fn)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do the update.

supports_sparse_bsc=True,
supports_forward_ad=True),
supports_forward_ad=True,
enable_skipped_device=('xpu',)),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to provide a utility function dedicated to xpu to enable some particular operations to be tested on xpu? Otherwise, we have to add more codes with more operations being enabled. And when XPU op coverage is good enough, the code is useless.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added enable_skipped_op_dict in common_utils.py to read the xpu_op_db.yaml file. Added enable_skipped_device in OpInfo to skip the unsupported op.

Comment on lines 686 to 691
# skip the test for a device
skip_device: Tuple = tuple()

# enable the test for a device
enable_skipped_device: Tuple = tuple()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The information is per op, why do we need to define two variables?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design is changed, enabled xpu test with xpu_op_db.yaml, no new fields are needed in OpInfo.

daisyden added 2 commits May 26, 2024 02:32
… support multiple devices.Support PYTORCH_TESTING_DEVICE_ONLY_FOR='xpu' to enable xpu test
@daisyden
Copy link
Owner Author

daisyden commented May 26, 2024

Tested on all the classes of test_ops.py, with command:

PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu" PYTORCH_ENABLE_XPU_FALLBACK=1 PYTORCH_TEST_WITH_SLOW=1 pytest -v test_ops.py

The log is attached.
0526_4.log

daisyden pushed a commit that referenced this pull request May 31, 2024
pytorch#126677)

…destruction of tensors cached by autocast

## Root Cause
For out-of-tree device extension it is loaded after torch (different .so), so the global variable `cached_casts` may be constructed before caching allocator and then destructed in reversed order when exit.

## Fix
Lazily initialize `cached_casts` to correct the order.

## How to Reproduce && Test
Modify the testcase `TestAutocastGPU.test_cast_cache_is_global` in test/test_autocast.py  to run on your out-of-tree device. You will see following failure in the end of test.
```bash
----------------------------------------------------------------------
Ran 1 test in 4.812s

OK
free: 0x30080ff44000400
terminate called after throwing an instance of 'c10::Error'
  what():  invalid device pointer: 0x30080ff44000400
Exception raised from free at /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/framework/core/caching_allocator.cpp:1609 (most recent call first):
frame #0: <unknown function> + 0x118fe1 (0x7ffaef4d3fe1 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11b1c4 (0x7ffaef4d61c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #2: <unknown function> + 0x117677 (0x7ffaef4d2677 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #3: <unknown function> + 0x11a2bf (0x7ffaef4d52bf in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #4: <unknown function> + 0x11a186 (0x7ffaef4d5186 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #5: <unknown function> + 0x119fde (0x7ffaef4d4fde in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #6: <unknown function> + 0x119d2e (0x7ffaef4d4d2e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #7: <unknown function> + 0x119be0 (0x7ffaef4d4be0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #8: <unknown function> + 0x119977 (0x7ffaef4d4977 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #9: <unknown function> + 0x119313 (0x7ffaef4d4313 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #10: <unknown function> + 0x118b4c (0x7ffaef4d3b4c in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #11: c10::Error::Error(c10::SourceLocation, std::string) + 0x34 (0x7ffaef4d27c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #12: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x7f (0x7ffaef4d04ed in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #13: torch_mlu::MLUCachingAllocator::Native::NativeCachingAllocator::free(void*) + 0xe6 (0x7ff9a8eeb112 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so)
frame #14: torch_mlu::MLUCachingAllocator::Native::local_raw_delete(void*) + 0x3b (0x7ff9a8ed9480 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so)                                                                                                                         frame #15: std::unique_ptr<void, void (*)(void*)>::~unique_ptr() + 0x50 (0x7ffb0a5ea322 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x1269890 (0x7ffb0a5e4890 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0x1269928 (0x7ffb0a5e4928 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #18: <unknown function> + 0x127572c (0x7ffb0a5f072c in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x1275758 (0x7ffb0a5f0758 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame pytorch#20: <unknown function> + 0xb9bc7 (0x7ffaef474bc7 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame pytorch#21: <unknown function> + 0xb97bc (0x7ffaef4747bc in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame pytorch#22: <unknown function> + 0xdbc50 (0x7ffaef496c50 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame pytorch#23: c10::TensorImpl::~TensorImpl() + 0x82 (0x7ffaef49157e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame pytorch#24: c10::TensorImpl::~TensorImpl() + 0x1c (0x7ffaef4915aa in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame pytorch#25: <unknown function> + 0x2f596d9 (0x7ffaf24fc6d9 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#26: <unknown function> + 0x2f589c2 (0x7ffaf24fb9c2 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#27: <unknown function> + 0x2f57b92 (0x7ffaf24fab92 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#28: <unknown function> + 0x2f5c228 (0x7ffaf24ff228 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#29: <unknown function> + 0x30f3f70 (0x7ffaf2696f70 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#30: <unknown function> + 0x30f3f90 (0x7ffaf2696f90 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#31: <unknown function> + 0x30f5004 (0x7ffaf2698004 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                                                                                                                                                                                frame pytorch#32: <unknown function> + 0x30f5024 (0x7ffaf2698024 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#33: <unknown function> + 0x31207f0 (0x7ffaf26c37f0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#34: <unknown function> + 0x3120814 (0x7ffaf26c3814 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#35: <unknown function> + 0x30f51e8 (0x7ffaf26981e8 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#36: <unknown function> + 0x30f5148 (0x7ffaf2698148 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#37: <unknown function> + 0x316ecea (0x7ffaf2711cea in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame pytorch#38: <unknown function> + 0x468a7 (0x7ffb0c9ed8a7 in /lib/x86_64-linux-gnu/libc.so.6)
frame pytorch#39: on_exit + 0 (0x7ffb0c9eda60 in /lib/x86_64-linux-gnu/libc.so.6)
<omitting python frames>
frame pytorch#47: __libc_start_main + 0xf3 (0x7ffb0c9cb083 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

```

Pull Request resolved: pytorch#126677
Approved by: https://github.com/ezyang
daisyden pushed a commit that referenced this pull request Nov 15, 2024
…ytorch#139659)

### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

Pull Request resolved: pytorch#139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
@daisyden daisyden merged commit b87ca3f into daisyden/baseline Nov 21, 2024
50 checks passed
daisyden pushed a commit that referenced this pull request Nov 22, 2024
See pytorch#140725 (comment)
Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]`
```
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame #8: 0x0000000100fccbe4 Python`run_mod + 168
    frame #9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame #15: 0x0000000100ff1564 Python`pymain_main + 304
    frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame #17: 0x000000019f630274 dyld`start + 2840
```

Pull Request resolved: pytorch#141296
Approved by: https://github.com/huydhn
daisyden pushed a commit that referenced this pull request Dec 25, 2024
…143550)

# Motivation
Fix pytorch#143543

# Solution
We should raise python exception instead of aborting...

# Additional Context
without this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
terminate called after throwing an instance of 'c10::Error'
  what():  device is out of range, device is 2, total number of device is 2.
Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame pytorch#20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame pytorch#21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
```
with this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream
    return torch._C._accelerator_getStream(device_index)
RuntimeError: The device index is out of range. It must be in [0, 2), but got 2.
```

Pull Request resolved: pytorch#143550
Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD
daisyden pushed a commit that referenced this pull request Mar 3, 2025
…pytorch#144120) (pytorch#146372)

Summary:

# Summary

### Sticky points

Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC

## Dependencies
- Flash PR: Dao-AILab/flash-attention#1419

### Other Points
- The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD

imported-using-ghimport

Test Plan:
Imported from OSS

Building in dev
`buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a  //caffe2:ATen-cu --show-full-output    `

I and Nming the .so I do see that the flash symbols are correctly named:
```
0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
```

Reviewed By: vkuzo

Differential Revision: D68502879

Pulled By: drisspg

Pull Request resolved: pytorch#146372
Approved by: https://github.com/jbschlosser
PenghuiCheng pushed a commit that referenced this pull request Jun 4, 2025
Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError`

`torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's  [`PyErr_SetString`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L282), namely
- Convert cstr into Python string with `PyUnicode_FromString`
- Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L32)
- Set `error_code` property using `PyObject_SetAttrString`
- decref all temporary references

Test that it works and captures CPP backtrace (in addition to CI) by running
```python
import os
os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1'

import torch

x = torch.rand(10, device="cuda")
y = torch.arange(20, device="cuda")
try:
    x[y] = 2
    print(x)
except torch.AcceleratorError as e:
    print("Exception was raised", e.args[0])
    print("Captured error code is ", e.error_code)
```

which produces following output
```
Exception was raised CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0
#7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0
#8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0
#9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0
#10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0
#11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0
#12 at::native::abs(at::Tensor const&) from ??:0
#13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0
#14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0
#15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#17 at::_ops::abs::call(at::Tensor const&) from ??:0
#18 at::native::isfinite(at::Tensor const&) from ??:0
#19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0
pytorch#20 at::_ops::isfinite::call(at::Tensor const&) from ??:0
pytorch#21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0
pytorch#22 PyObject_CallFunctionObjArgs from ??:0
pytorch#23 _PyObject_MakeTpCall from ??:0
pytorch#24 _PyEval_EvalFrameDefault from ??:0
pytorch#25 _PyObject_FastCallDictTstate from ??:0
pytorch#26 _PyStack_AsDict from ??:0
pytorch#27 _PyObject_MakeTpCall from ??:0
pytorch#28 _PyEval_EvalFrameDefault from ??:0
pytorch#29 _PyFunction_Vectorcall from ??:0
pytorch#30 _PyEval_EvalFrameDefault from ??:0
pytorch#31 _PyFunction_Vectorcall from ??:0
pytorch#32 _PyEval_EvalFrameDefault from ??:0
pytorch#33 _PyFunction_Vectorcall from ??:0
pytorch#34 _PyEval_EvalFrameDefault from ??:0
pytorch#35 PyFrame_GetCode from ??:0
pytorch#36 PyNumber_Xor from ??:0
pytorch#37 PyObject_Str from ??:0
pytorch#38 PyFile_WriteObject from ??:0
pytorch#39 _PyWideStringList_AsList from ??:0
pytorch#40 _PyDict_NewPresized from ??:0
pytorch#41 _PyEval_EvalFrameDefault from ??:0
pytorch#42 PyEval_EvalCode from ??:0
pytorch#43 PyEval_EvalCode from ??:0
pytorch#44 PyUnicode_Tailmatch from ??:0
pytorch#45 PyInit__collections from ??:0
pytorch#46 PyUnicode_Tailmatch from ??:0
pytorch#47 _PyRun_SimpleFileObject from ??:0
pytorch#48 _PyRun_AnyFileObject from ??:0
pytorch#49 Py_RunMain from ??:0
pytorch#50 Py_BytesMain from ??:0
pytorch#51 __libc_init_first from ??:0
pytorch#52 __libc_start_main from ??:0
pytorch#53 _start from ??:0

Captured error code is  710
```
Pull Request resolved: pytorch#152023
Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel
ghstack dependencies: pytorch#154436
PenghuiCheng pushed a commit that referenced this pull request Jun 5, 2025
Use uint64_t index types to avoid
```
 torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long'
    #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24
    #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12
    #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
    #3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
```

Pull Request resolved: pytorch#154809
Approved by: https://github.com/soulitzer
daisyden pushed a commit that referenced this pull request Jun 10, 2025
Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b

Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below
```
(lldb) up
frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10
   556 	  bool has_names() const {
   557 	    // If a user is using unnamed tensors, then we can short-circuit right here.
   558 	    // Otherwise, impl::has_names attempts to retrieve names.
-> 559 	    if (!impl_->has_named_tensor_meta()) {
   560 	      return false;
   561 	    }
   562 	    return impl::has_names(unsafeGetTensorImpl());
(lldb) up
frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3
   20  	int64_t dimname_to_position(const Tensor& tensor, Dimname dim) {
   21  	  TORCH_CHECK(dim.type() != NameType::WILDCARD,
   22  	      "Please look up dimensions by name, got: name = None.");
-> 23  	  TORCH_CHECK(tensor.has_names(),
   24  	      "Name ", dim, " not found in ", toDimnameRepr(tensor), ".");
   25  	  const auto names = tensor.names();
   26
```

TODOs:
 - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable)
 - Replace  `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests

Fixes pytorch#155306
Pull Request resolved: pytorch#155383
Approved by: https://github.com/cyyever, https://github.com/ezyang
ghstack dependencies: pytorch#155382
daisyden pushed a commit that referenced this pull request Jun 25, 2025
…torch#156600)

Don't call `sum()` on a tensor that is default constructed.

Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this:

```
Traceback (most recent call last):
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper
    method(*args, **kwargs)
  File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps
    ln_out_cuda.backward(grad_output_cuda)
  File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward
    torch.autograd.backward(
  File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward
    _engine_run_backward(
  File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: tensor does not have a device
Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 at::TensorBase::options() const from :0
#8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0
#9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0
#14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0
#17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0

```

Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment.
Pull Request resolved: pytorch#156600
Approved by: https://github.com/eqy, https://github.com/ngimel
pytorchmergebot pushed a commit that referenced this pull request Jul 24, 2025
For tensor with non-zero offset, it must be multiplied by element size

Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with
```
C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15
Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib)
frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib)
frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib)
frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib)
frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib)
frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib)
frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib)
frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib)
frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib)
frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib)
frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib)
frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic)
frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic)
```
Pull Request resolved: pytorch#158690
Approved by: https://github.com/angelayi
daisyden pushed a commit that referenced this pull request Sep 19, 2025
)

Summary:
This diff fixes two things which come up when testing a tgif-published pt2 model remote net:
1) Updates isSameDevice to handle meta device to avoid this error:
```
what():  Unsupported device typemeta and meta
Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20
```

2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now.

Test Plan:
Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error:
```
Unsupported device typemeta and meta
```
Then after change #1 and before change #2 we get:
```
what():  Mismatched device for merge.user_tower.linear.weight: meta vs cpu
Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374
```
After change run is successful
Command:
```
MODEL_ENTITY_ID=921242082
SNAPSHOT_ID=1269
module_name=merge
SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs
buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0"  --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt
```

Rollback Plan:

Differential Revision: D80713052

Pull Request resolved: pytorch#162842
Approved by: https://github.com/henryoier
daisyden pushed a commit that referenced this pull request Oct 16, 2025
…rch#165479)

These happen when building with CMAKE_BUILD_TYPE=RelWithAssert

This should fix two types of failures that started with pytorch#163665

Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now)

The first one type is

Truncated:
```
    default_pg, _ = _new_process_group_helper(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper
    backend_class = creator_fn(dist_backend_opts, backend_options)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg
    return FakeProcessGroup._create_internal(
RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero.
Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
#7 c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) from ??:0
#8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance*, void const*) from init.cpp:0
#9 pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) from :0
#10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)pytorch#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)pytorch#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > (*)(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
```
and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead.  However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR

The other one type is
```
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import
    self.assertEqual(out, "")
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != ''
- /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode.
-   if is_available() and not torch._C._c10d_init():

To execute this test, run the following from the base repo dir:
    python test/test_testing.py TestImports.test_no_warning_on_import
```
which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one?

Pull Request resolved: pytorch#165479
Approved by: https://github.com/ezyang
daisyden pushed a commit that referenced this pull request Oct 23, 2025
Previously g3 = NVIDIA Tesla M60
Now g6 = NVIDIA L4
Also change cuda arch list accordingly

Pros:
More memory, newer GPU

Cons:
That was one of the few remaining tests on g3 runners, so we probably lost coverage?

We can probably run more tests in parallel now but I'm not going to do that here

Disabled a bunch of sparse tests and nestedtensor tests that were previously skipped due to not having sufficient hardware?  They are now failing with
```
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3293, in wrapper
    method(*args, **kwargs)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3292, in wrapper
    with policy():
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2532, in __enter__
    self.beforeStreams[-1].synchronize()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/streams.py", line 105, in synchronize
    super().synchronize()
torch.AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from stream_synchronize at /var/lib/jenkins/workspace/c10/cuda/CUDAFunctions.h:120 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) [clone .cold] from CUDAException.cpp:0
#7 THCPStream_synchronize(_object*, _object*) from Stream.cpp:0
#8 cfunction_vectorcall_NOARGS from /usr/local/src/conda/python-3.10.14/Objects/methodobject.c:489
#9 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114
#10 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46
#11 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.14/Include/cpython/abstract.h:114
#12 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.14/Include/internal/pycore_ceval.h:46
```
when run with cuda launch blocking I got a ton of stuff like
```

/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [2,7,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [5,3,0], thread: [3,7,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,0,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,1,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,1,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,1,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,2,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [2,2,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,2,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [0,3,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,3,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [1,4,0] Assertion `value < upper_bound` failed.
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/integer_subbyte.h:124: cutlass::integer_subbyte<Bits, Signed>::integer_subbyte(unsigned int) [with int Bits = 2; __nv_bool Signed = false]: block: [3,8,0], thread: [3,4,0] Assertion `value < upper_bound` failed.
```

Pull Request resolved: pytorch#165158
Approved by: https://github.com/seemethere
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants