forked from pytorch/pytorch
-
Notifications
You must be signed in to change notification settings - Fork 1
Daisyden/upstream #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Differential Revision: [D63851216](https://our.internmc.facebook.com/intern/diff/D63851216) Pull Request resolved: pytorch#136999 Approved by: https://github.com/leslie-fang-intel, https://github.com/chenyang78, https://github.com/hl475
This PR contains multiple fixes for issue pytorch#135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: https://github.com/pytorch/pytorch/blob/1f3a79379012b408e0375e81fe9205dcba5e34ba/c10/cuda/impl/CUDAGuardImpl.h#L106-L121 When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- **that's where rank 1, 2, ... can create extra context on device 0!** ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: pytorch#135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
Use the latest clang. Pull Request resolved: pytorch#128763 Approved by: https://github.com/malfet
Pull Request resolved: pytorch#136960 Approved by: https://github.com/guilhermeleobas, https://github.com/jansel
…torch#137135) NumPy now throws an OverflowError when trying to create np.uint64(-1) Pull Request resolved: pytorch#137135 Approved by: https://github.com/Skylion007
Summary: Link CPU pins function in MTIA hooks to the host allocator implementation Test Plan: signals unit test in D63709424 Differential Revision: D63352770 Pull Request resolved: pytorch#137283 Approved by: https://github.com/egienvalue
…137257) One-shot all-reduce did not have a barrier at the end. It was possible for a rank to write to its p2p buffer for the next collective before another rank finished reading it for the previous collective. Also removing the fuse-input-copy optimization. The synchronization complexity probably outweighs the saving. Pull Request resolved: pytorch#137257 Approved by: https://github.com/Chillee
Pull Request resolved: pytorch#136961 Approved by: https://github.com/jansel
) This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their respective inplace versions). These functions only had refs implementations, which was being the root cause of a significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA backend. Running the meta functions resulted in the following improvements: - `lerp` calls: 1,550ms to 140ms (10x) - `addcdiv` calls: 640ms to 350ms (1.8x) - `addcmul` calls: 620ms to 300ms (2.05x) [1]: https://github.com/pytorch/xla/issues/7923 Pull Request resolved: pytorch#136909 Approved by: https://github.com/jansel
…ytorch#136899)" This reverts commit 4f93de8. Reverted pytorch#136899 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#136899 (comment)))
Pull Request resolved: pytorch#137181 Approved by: https://github.com/Skylion007
We didn't support multiple levels of vmap. The main problem is, during the batching rule, we need to exclude the vmap dispatch key (FuncTorchBatched) like how our C++ batching rules do it. Test Plan: - new test Pull Request resolved: pytorch#137306 Approved by: https://github.com/Chillee
Title Differential Revision: [D60432217](https://our.internmc.facebook.com/intern/diff/D60432217/) Pull Request resolved: pytorch#132703 Approved by: https://github.com/tarun292
…moved (pytorch#136835) When the stub file `nn/parallel/distributed.pyi` was removed (pytorch#88701), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: pytorch#136835 Approved by: https://github.com/kwen2501
…ntion create_block_mask dynamic shapes (pytorch#137163) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#137163 Approved by: https://github.com/Chillee
Summary: When we handle dynamic shapes markers like `Dim.AUTO, Dim.DYNAMIC`, we use dynamo decorators, attaching set attributes to the export input tensors, e.g. `x._dynamo_dynamic_indices = set()`. I thought this was fine, since it's done all the time with torch.compile, but it breaks some PT2Inference tests, specifically because unpickling a set attribute isn't possible with the C++ torch::jit::pickle_load call. We've agreed that the PT2Inference side will clone sample inputs & pickle the original inputs to be safe, but this still establishes a nice invariant that user-facing decorators are both ignored & cleaned out in the lifecycle of an export call. Test Plan: test_export Differential Revision: D63773534 Pull Request resolved: pytorch#137230 Approved by: https://github.com/avikchaudhuri
…24485) This follows pytorch#119449 to make setenv thread-safe. Pull Request resolved: pytorch#124485 Approved by: https://github.com/eqy
…torch#137236) Summary: Special autotuning configs like `num_warps` and `num_stages` can be passed to the kernel as parameters. The `config.all_kwargs()` call [here](https://github.com/triton-lang/triton/blob/762a7d197c4ea68e6e3a7895b5343a4afe894d0d/python/triton/runtime/autotuner.py#L106) in the Trtion code includes those special configs (names and values) into the potential arguments to the kernel. [Here](https://github.com/triton-lang/triton/blob/762a7d197c4ea68e6e3a7895b5343a4afe894d0d/python/triton/runtime/jit.py#L613) some of those may be included in actual kenrel arguments, given that their names are present among the kernel parameters. This PR replicates this behavior in user-defined Triton kernel compilation in PT2. Resolves pytorch#136550. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_params inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] .inductor [('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] .inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('benchmarking.TritonBenchmarker.triton_do_bench', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] . ---------------------------------------------------------------------- Ran 6 tests in 6.283s OK ``` Pull Request resolved: pytorch#137236 Approved by: https://github.com/zou3519
…Kernel (pytorch#136331)" This reverts commit 592e3a3. Reverted pytorch#136331 on behalf of https://github.com/albanD due to Breaks aarch64 builds, see link below ([comment](pytorch#136331 (comment)))
This reverts commit a93d387. Reverted pytorch#135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](pytorch#135273 (comment)))
…orch#136909)" This reverts commit e4b98b1. Reverted pytorch#136909 on behalf of https://github.com/albanD due to breaks trunk jobs ([comment](pytorch#136909 (comment)))
Summary: Fixes pytorch#136209. Because _scaled_mm has an out variant, the generated cpp fallback call should call _scaled_mm_out. The ABI-compatible mode needs more work. Differential Revision: [D63757728](https://our.internmc.facebook.com/intern/diff/D63757728) Pull Request resolved: pytorch#137008 Approved by: https://github.com/hl475
Summary: Similar to pytorch#137008, but for supporting _scaled_mm in the ABI-compatible mode. Differential Revision: [D63757729](https://our.internmc.facebook.com/intern/diff/D63757729) Pull Request resolved: pytorch#137132 Approved by: https://github.com/chenyang78 ghstack dependencies: pytorch#137008
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#137320 Approved by: https://github.com/Skylion007
raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: pytorch#131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <[email protected]>
### Context This PR allows CUTLASS kernels usage in AOTI. It does this by: * For any CUTLASS kernels that win during autotuning, compile them as a .so & .o * When creating the final model .so, link all the CUTLASS kernels .o files * Make sure we codegen things correctly (argument dtypes and specify extern "C" linking for the CUTLASS kernel) ### Example https://gist.github.com/ColinPeppler/e834fa2255c37e9444b6d540bf7bd04d#file-model-cpp-L548-L549 ``` TORCH_LOGS="+output_code" python test/inductor/test_cutlass_backend.py -v -k test_max_autotune_cutlass_backend_regular_mm ``` Pull Request resolved: pytorch#134379 Approved by: https://github.com/tenpercent, https://github.com/chenyang78
Summary: hardcode "val" field for autocast (similar to set_grad_enabled), to bypass the verifier check. Test Plan: CI Differential Revision: D63345767 Pull Request resolved: pytorch#137287 Approved by: https://github.com/angelayi
…ytorch#137231) Summary: We added the unit test for recent added pad_mm pattern in customized optimus D63040455, where it will resolve the long computation kernel issue for BF16 on A100. Test Plan: ``` buck2 test mode/opt //caffe2/test/inductor:pad_mm -- test_pad_mm_bf16 ``` Buck UI: https://www.internalfb.com/buck2/4dd4c90c-4a2a-4859-923c-a4008f78a1cd Test UI: https://www.internalfb.com/intern/testinfra/testrun/9851624237127136 Network: Up: 100KiB Down: 4.3GiB (reSessionID-87f11454-d920-47af-9af5-39ca0572b7c6) Jobs completed: 7079. Time elapsed: 3:34.3s. Cache hits: 99%. Commands: 7061 (cached: 7024, remote: 19, local: 18) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D63794727 Pull Request resolved: pytorch#137231 Approved by: https://github.com/henrylhtsang
The red dotted line is 1.5 <img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517"> expected taken from the average. <img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de"> Pull Request resolved: pytorch#136573 Approved by: https://github.com/ezyang
Differential Revision: [D63864743](https://our.internmc.facebook.com/intern/diff/D63864743) Pull Request resolved: pytorch#137310 Approved by: https://github.com/avikchaudhuri
…04361 (pytorch#137637) Pull Request resolved: pytorch#137637 Approved by: https://github.com/albanD
…enchmarks (pytorch#137541) Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance, I set it up with 20% threshold (8*2)++ others are stable within +-1.5% <img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf"> <img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68"> Pull Request resolved: pytorch#137541 Approved by: https://github.com/aorenste
…er (pytorch#137308) Fixes pytorch#115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about. The PR improves the documentation and CMake's behavior for ROCM builds. - Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`. - CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH` Pull Request resolved: pytorch#137308 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily
…ytorch#137654) Fixed issue where nn.Transformer().generate_square_subsequent_mask() doesn't respect set_default_device() and set_default_dtype(). Fixes pytorch#137186 Pull Request resolved: pytorch#137654 Approved by: https://github.com/mikaylagawarecki
Summary: Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay. `ProcessGroupNCCL` has 2 sequence number counter, `seqCollective_` and `seqP2P_`. https://github.com/pytorch/pytorch/blob/b18ba9419e7062acbd49bef5c388e1b1d6a170dc/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp#L1188-L1191 However, `WorkNCCL` only has one sequence number member `seq_`. https://github.com/pytorch/pytorch/blob/b18ba9419e7062acbd49bef5c388e1b1d6a170dc/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp#L387 We need to match collective and p2p with wait separately. facebookresearch/param@29b5a46 Depend on: pytorch#135132 Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test Differential Revision: Pull Request resolved: pytorch#134578 Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o
Summary: Remove some stale items Test Plan: CI Differential Revision: D64133246 Pull Request resolved: pytorch#137634 Approved by: https://github.com/hl475
Remove all the keyword static for constants of vec registers in exp_u20 implementation. With the bf16 input shape of BertLarge, the SDPA kernel improves from 5.1ms to 4.7ms on SPR 56 threads. Pull Request resolved: pytorch#137571 Approved by: https://github.com/jgong5
…v2 (pytorch#137149) during auto_functionalize_v2 if we encounter a view such that size() stride() and storage_offset() matches the base we create a view that is regenerated by calling aten.alias instead of as_strided for better performance. Pull Request resolved: pytorch#137149 Approved by: https://github.com/zou3519
Summary: Fixing a warning, so we can enable it globally. Test Plan: Sandcastle-only, no runtime changes. Differential Revision: D64122115 Pull Request resolved: pytorch#137619 Approved by: https://github.com/Skylion007
…uffer() to 0 (pytorch#137569) It seems that there's a bug in `TensorMaker` - it would treat `storage_offset` as bytes when calculating the storage size, but as numel when setting the tensor `storage_offset`. This seems to be causing tensors returned by get_buffer() with non-0 offset to report wrong storage size. Will look into the `TensorMaker` issue further. But for `get_buffer()`, it seems more natural to just incorporate the offset into the data pointer. Pull Request resolved: pytorch#137569 Approved by: https://github.com/weifengpy ghstack dependencies: pytorch#137567
**Summary** Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion. Pull Request resolved: pytorch#135101 Approved by: https://github.com/jgong5, https://github.com/jansel
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes pytorch#89492 Fixes pytorch#75240 Pull Request resolved: pytorch#136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby, https://github.com/eqy
…h#136867) Pull Request resolved: pytorch#136867 Approved by: https://github.com/Chillee
Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants. Test Plan: fixed test Differential Revision: D64081786 Pull Request resolved: pytorch#137547 Approved by: https://github.com/tugsbayasgalan
…rch#137606) Some unit tests were failing relating to argmin_vec/argmax_vec due to a bug in GCC affecting versions <= 12 on aarch64 platforms with SVE https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 Fixes pytorch#137597 Pull Request resolved: pytorch#137606 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <[email protected]>
Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header. Test Plan: buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom data_loader.dataset.table_ds=[2024-09-04] data_loader.dataset.batch_size=512 max_ind_range=10 w/o this diff, it'll show 2.18 nccl version Differential Revision: D62371434 Pull Request resolved: pytorch#135472 Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa
…when unbacked (pytorch#137097)" This reverts commit 4304c68. Reverted pytorch#137097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, it seems to increase the compilation time a lot causing some jobs to timeout ([comment](pytorch#137097 (comment)))
daisyden
pushed a commit
that referenced
this pull request
Nov 15, 2024
…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj
daisyden
pushed a commit
that referenced
this pull request
Nov 21, 2024
Summary: OSS flight recorder does not work because we renamed `trace_dir` to `folder` in the internal version to reuse the loader. Fixes item #2 in reported issue: pytorch#140879 Test Plan: BEFORE: ``` ❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node1_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 44, in main details, version = read_dir(args) File "/home/cpio/local/pytorch/tools/flight_recorder/components/loader.py", line 89, in read_dir assert len(details) > 0, f"no files loaded from {args.folder} with prefix {prefix}" AttributeError: 'Namespace' object has no attribute 'folder' ``` AFTER: ``` python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main db = build_db(details, args, version) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db check_no_missing_dump_files(entries, memberships) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files dumps_ranks == all_ranks AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119} ❯ git status fatal: not a git repository (or any parent up to mount point /data/users/cpio) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). ❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main db = build_db(details, args, version) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db check_no_missing_dump_files(entries, memberships) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files dumps_ranks == all_ranks AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119} ``` Differential Revision: D66117013 Pull Request resolved: pytorch#140973 Approved by: https://github.com/Skylion007, https://github.com/fduwjj
daisyden
pushed a commit
that referenced
this pull request
Nov 22, 2024
See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame #15: 0x0000000100ff1564 Python`pymain_main + 304 frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame #17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn
daisyden
pushed a commit
that referenced
this pull request
Dec 25, 2024
…143550) # Motivation Fix pytorch#143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame pytorch#20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame pytorch#21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: pytorch#143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD
daisyden
pushed a commit
that referenced
this pull request
Mar 3, 2025
…pytorch#144120) (pytorch#146372) Summary: # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: Dao-AILab/flash-attention#1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg Pull Request resolved: pytorch#146372 Approved by: https://github.com/jbschlosser
daisyden
pushed a commit
that referenced
this pull request
Apr 1, 2025
Summary:
fix another combo kernel logging error:
File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 2036, in _init
self.create_combo_kernel_nodes(num_ck_nodes=None)
File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 3068, in create_combo_kernel_nodes
log.debug("ComboKernels: Generating with num_ck_nodes = %d...", num_ck_nodes)
Message: 'ComboKernels: Generating with num_ck_nodes = %d...'
Arguments: (None,)
Test Plan:
Verified in test_combo_kernel.py
the logging error went away.
Differential Revision: D71655949
Pull Request resolved: pytorch#149772
Approved by: https://github.com/ColinPeppler, https://github.com/Skylion007
PenghuiCheng
pushed a commit
that referenced
this pull request
Jun 5, 2025
Use uint64_t index types to avoid
```
torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long'
#0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24
#1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12
#2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
#3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
```
Pull Request resolved: pytorch#154809
Approved by: https://github.com/soulitzer
daisyden
pushed a commit
that referenced
this pull request
Jun 10, 2025
Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below ``` (lldb) up frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10 556 bool has_names() const { 557 // If a user is using unnamed tensors, then we can short-circuit right here. 558 // Otherwise, impl::has_names attempts to retrieve names. -> 559 if (!impl_->has_named_tensor_meta()) { 560 return false; 561 } 562 return impl::has_names(unsafeGetTensorImpl()); (lldb) up frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3 20 int64_t dimname_to_position(const Tensor& tensor, Dimname dim) { 21 TORCH_CHECK(dim.type() != NameType::WILDCARD, 22 "Please look up dimensions by name, got: name = None."); -> 23 TORCH_CHECK(tensor.has_names(), 24 "Name ", dim, " not found in ", toDimnameRepr(tensor), "."); 25 const auto names = tensor.names(); 26 ``` TODOs: - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable) - Replace `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests Fixes pytorch#155306 Pull Request resolved: pytorch#155383 Approved by: https://github.com/cyyever, https://github.com/ezyang ghstack dependencies: pytorch#155382
pytorchmergebot
pushed a commit
that referenced
this pull request
Jul 24, 2025
For tensor with non-zero offset, it must be multiplied by element size Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with ``` C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15 Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib) frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib) frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib) frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib) frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib) frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib) frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib) frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib) frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib) frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib) frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib) frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic) frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic) ``` Pull Request resolved: pytorch#158690 Approved by: https://github.com/angelayi
daisyden
pushed a commit
that referenced
this pull request
Sep 19, 2025
) Summary: This diff fixes two things which come up when testing a tgif-published pt2 model remote net: 1) Updates isSameDevice to handle meta device to avoid this error: ``` what(): Unsupported device typemeta and meta Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20 ``` 2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now. Test Plan: Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error: ``` Unsupported device typemeta and meta ``` Then after change #1 and before change #2 we get: ``` what(): Mismatched device for merge.user_tower.linear.weight: meta vs cpu Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374 ``` After change run is successful Command: ``` MODEL_ENTITY_ID=921242082 SNAPSHOT_ID=1269 module_name=merge SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0" --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt ``` Rollback Plan: Differential Revision: D80713052 Pull Request resolved: pytorch#162842 Approved by: https://github.com/henryoier
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #ISSUE_NUMBER