Daisyden/upstream #2

daisyden · 2024-10-22T02:28:31Z

Fixes #ISSUE_NUMBER

Differential Revision: [D63851216](https://our.internmc.facebook.com/intern/diff/D63851216) Pull Request resolved: pytorch#136999 Approved by: https://github.com/leslie-fang-intel, https://github.com/chenyang78, https://github.com/hl475

This PR contains multiple fixes for issue pytorch#135279: ## First part: Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call. As its name suggests, it May Init Ctx. ## Second part: Even with the above fix, additional contexts are still observed during Work object destruction, e.g. ``` work = dist.all_reduce(tensor, async_op=True) time.sleep(5) <-- no additional context yet del work <-- additional context shows up ``` ### Debug process Chasing it down to destruction of a `Future` object -- a member variable of `Work`. Then further down to the following member of `Future`: ``` std::vector<c10::Event> events_; ``` When the `events_` are destroyed, we hit the road down to: https://github.com/pytorch/pytorch/blob/1f3a79379012b408e0375e81fe9205dcba5e34ba/c10/cuda/impl/CUDAGuardImpl.h#L106-L121 When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 -- **that's where rank 1, 2, ... can create extra context on device 0!** ### Solution This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard. ## Test Added test_extra_cuda_context, implemented via - `pynvml` (if available), or - memory consumption check. `python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context` Pull Request resolved: pytorch#135273 Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy

Use the latest clang. Pull Request resolved: pytorch#128763 Approved by: https://github.com/malfet

Pull Request resolved: pytorch#136960 Approved by: https://github.com/guilhermeleobas, https://github.com/jansel

…torch#137135) NumPy now throws an OverflowError when trying to create np.uint64(-1) Pull Request resolved: pytorch#137135 Approved by: https://github.com/Skylion007

Summary: Link CPU pins function in MTIA hooks to the host allocator implementation Test Plan: signals unit test in D63709424 Differential Revision: D63352770 Pull Request resolved: pytorch#137283 Approved by: https://github.com/egienvalue

…137257) One-shot all-reduce did not have a barrier at the end. It was possible for a rank to write to its p2p buffer for the next collective before another rank finished reading it for the previous collective. Also removing the fuse-input-copy optimization. The synchronization complexity probably outweighs the saving. Pull Request resolved: pytorch#137257 Approved by: https://github.com/Chillee

Pull Request resolved: pytorch#136961 Approved by: https://github.com/jansel

) This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their respective inplace versions). These functions only had refs implementations, which was being the root cause of a significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA backend. Running the meta functions resulted in the following improvements: - `lerp` calls: 1,550ms to 140ms (10x) - `addcdiv` calls: 640ms to 350ms (1.8x) - `addcmul` calls: 620ms to 300ms (2.05x) [1]: https://github.com/pytorch/xla/issues/7923 Pull Request resolved: pytorch#136909 Approved by: https://github.com/jansel

…ytorch#136899)" This reverts commit 4f93de8. Reverted pytorch#136899 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#136899 (comment)))

Pull Request resolved: pytorch#137181 Approved by: https://github.com/Skylion007

We didn't support multiple levels of vmap. The main problem is, during the batching rule, we need to exclude the vmap dispatch key (FuncTorchBatched) like how our C++ batching rules do it. Test Plan: - new test Pull Request resolved: pytorch#137306 Approved by: https://github.com/Chillee

Title Differential Revision: [D60432217](https://our.internmc.facebook.com/intern/diff/D60432217/) Pull Request resolved: pytorch#132703 Approved by: https://github.com/tarun292

…moved (pytorch#136835) When the stub file `nn/parallel/distributed.pyi` was removed (pytorch#88701), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: pytorch#136835 Approved by: https://github.com/kwen2501

…ntion create_block_mask dynamic shapes (pytorch#137163) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#137163 Approved by: https://github.com/Chillee

Summary: When we handle dynamic shapes markers like `Dim.AUTO, Dim.DYNAMIC`, we use dynamo decorators, attaching set attributes to the export input tensors, e.g. `x._dynamo_dynamic_indices = set()`. I thought this was fine, since it's done all the time with torch.compile, but it breaks some PT2Inference tests, specifically because unpickling a set attribute isn't possible with the C++ torch::jit::pickle_load call. We've agreed that the PT2Inference side will clone sample inputs & pickle the original inputs to be safe, but this still establishes a nice invariant that user-facing decorators are both ignored & cleaned out in the lifecycle of an export call. Test Plan: test_export Differential Revision: D63773534 Pull Request resolved: pytorch#137230 Approved by: https://github.com/avikchaudhuri

…24485) This follows pytorch#119449 to make setenv thread-safe. Pull Request resolved: pytorch#124485 Approved by: https://github.com/eqy

…torch#137236) Summary: Special autotuning configs like `num_warps` and `num_stages` can be passed to the kernel as parameters. The `config.all_kwargs()` call [here](https://github.com/triton-lang/triton/blob/762a7d197c4ea68e6e3a7895b5343a4afe894d0d/python/triton/runtime/autotuner.py#L106) in the Trtion code includes those special configs (names and values) into the potential arguments to the kernel. [Here](https://github.com/triton-lang/triton/blob/762a7d197c4ea68e6e3a7895b5343a4afe894d0d/python/triton/runtime/jit.py#L613) some of those may be included in actual kenrel arguments, given that their names are present among the kernel parameters. This PR replicates this behavior in user-defined Triton kernel compilation in PT2. Resolves pytorch#136550. Test Plan: ``` $ python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_params inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] .inductor [('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] .inductor [] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] .inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('benchmarking.TritonBenchmarker.triton_do_bench', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)] inline_call [] stats [('calls_captured', 2), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] . ---------------------------------------------------------------------- Ran 6 tests in 6.283s OK ``` Pull Request resolved: pytorch#137236 Approved by: https://github.com/zou3519

…Kernel (pytorch#136331)" This reverts commit 592e3a3. Reverted pytorch#136331 on behalf of https://github.com/albanD due to Breaks aarch64 builds, see link below ([comment](pytorch#136331 (comment)))

This reverts commit a93d387. Reverted pytorch#135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](pytorch#135273 (comment)))

…orch#136909)" This reverts commit e4b98b1. Reverted pytorch#136909 on behalf of https://github.com/albanD due to breaks trunk jobs ([comment](pytorch#136909 (comment)))

Summary: Fixes pytorch#136209. Because _scaled_mm has an out variant, the generated cpp fallback call should call _scaled_mm_out. The ABI-compatible mode needs more work. Differential Revision: [D63757728](https://our.internmc.facebook.com/intern/diff/D63757728) Pull Request resolved: pytorch#137008 Approved by: https://github.com/hl475

Summary: Similar to pytorch#137008, but for supporting _scaled_mm in the ABI-compatible mode. Differential Revision: [D63757729](https://our.internmc.facebook.com/intern/diff/D63757729) Pull Request resolved: pytorch#137132 Approved by: https://github.com/chenyang78 ghstack dependencies: pytorch#137008

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#137320 Approved by: https://github.com/Skylion007

raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: pytorch#131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <[email protected]>

### Context This PR allows CUTLASS kernels usage in AOTI. It does this by: * For any CUTLASS kernels that win during autotuning, compile them as a .so & .o * When creating the final model .so, link all the CUTLASS kernels .o files * Make sure we codegen things correctly (argument dtypes and specify extern "C" linking for the CUTLASS kernel) ### Example https://gist.github.com/ColinPeppler/e834fa2255c37e9444b6d540bf7bd04d#file-model-cpp-L548-L549 ``` TORCH_LOGS="+output_code" python test/inductor/test_cutlass_backend.py -v -k test_max_autotune_cutlass_backend_regular_mm ``` Pull Request resolved: pytorch#134379 Approved by: https://github.com/tenpercent, https://github.com/chenyang78

Summary: hardcode "val" field for autocast (similar to set_grad_enabled), to bypass the verifier check. Test Plan: CI Differential Revision: D63345767 Pull Request resolved: pytorch#137287 Approved by: https://github.com/angelayi

…ytorch#137231) Summary: We added the unit test for recent added pad_mm pattern in customized optimus D63040455, where it will resolve the long computation kernel issue for BF16 on A100. Test Plan: ``` buck2 test mode/opt //caffe2/test/inductor:pad_mm -- test_pad_mm_bf16 ``` Buck UI: https://www.internalfb.com/buck2/4dd4c90c-4a2a-4859-923c-a4008f78a1cd Test UI: https://www.internalfb.com/intern/testinfra/testrun/9851624237127136 Network: Up: 100KiB Down: 4.3GiB (reSessionID-87f11454-d920-47af-9af5-39ca0572b7c6) Jobs completed: 7079. Time elapsed: 3:34.3s. Cache hits: 99%. Commands: 7061 (cached: 7024, remote: 19, local: 18) Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0 Differential Revision: D63794727 Pull Request resolved: pytorch#137231 Approved by: https://github.com/henrylhtsang

The red dotted line is 1.5 <img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517"> expected taken from the average. <img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de"> Pull Request resolved: pytorch#136573 Approved by: https://github.com/ezyang

Differential Revision: [D63864743](https://our.internmc.facebook.com/intern/diff/D63864743) Pull Request resolved: pytorch#137310 Approved by: https://github.com/avikchaudhuri

…04361 (pytorch#137637) Pull Request resolved: pytorch#137637 Approved by: https://github.com/albanD

…enchmarks (pytorch#137541) Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance, I set it up with 20% threshold (8*2)++ others are stable within +-1.5% <img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf"> <img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68"> Pull Request resolved: pytorch#137541 Approved by: https://github.com/aorenste

…er (pytorch#137308) Fixes pytorch#115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about. The PR improves the documentation and CMake's behavior for ROCM builds. - Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`. - CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH` Pull Request resolved: pytorch#137308 Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily

…ytorch#137654) Fixed issue where nn.Transformer().generate_square_subsequent_mask() doesn't respect set_default_device() and set_default_dtype(). Fixes pytorch#137186 Pull Request resolved: pytorch#137654 Approved by: https://github.com/mikaylagawarecki

Summary: Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay. `ProcessGroupNCCL` has 2 sequence number counter, `seqCollective_` and `seqP2P_`. https://github.com/pytorch/pytorch/blob/b18ba9419e7062acbd49bef5c388e1b1d6a170dc/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp#L1188-L1191 However, `WorkNCCL` only has one sequence number member `seq_`. https://github.com/pytorch/pytorch/blob/b18ba9419e7062acbd49bef5c388e1b1d6a170dc/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp#L387 We need to match collective and p2p with wait separately. facebookresearch/param@29b5a46 Depend on: pytorch#135132 Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test Differential Revision: Pull Request resolved: pytorch#134578 Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o

Summary: Remove some stale items Test Plan: CI Differential Revision: D64133246 Pull Request resolved: pytorch#137634 Approved by: https://github.com/hl475

Remove all the keyword static for constants of vec registers in exp_u20 implementation. With the bf16 input shape of BertLarge, the SDPA kernel improves from 5.1ms to 4.7ms on SPR 56 threads. Pull Request resolved: pytorch#137571 Approved by: https://github.com/jgong5

…v2 (pytorch#137149) during auto_functionalize_v2 if we encounter a view such that size() stride() and storage_offset() matches the base we create a view that is regenerated by calling aten.alias instead of as_strided for better performance. Pull Request resolved: pytorch#137149 Approved by: https://github.com/zou3519

Summary: Fixing a warning, so we can enable it globally. Test Plan: Sandcastle-only, no runtime changes. Differential Revision: D64122115 Pull Request resolved: pytorch#137619 Approved by: https://github.com/Skylion007

…uffer() to 0 (pytorch#137569) It seems that there's a bug in `TensorMaker` - it would treat `storage_offset` as bytes when calculating the storage size, but as numel when setting the tensor `storage_offset`. This seems to be causing tensors returned by get_buffer() with non-0 offset to report wrong storage size. Will look into the `TensorMaker` issue further. But for `get_buffer()`, it seems more natural to just incorporate the offset into the data pointer. Pull Request resolved: pytorch#137569 Approved by: https://github.com/weifengpy ghstack dependencies: pytorch#137567

**Summary** Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion. Pull Request resolved: pytorch#135101 Approved by: https://github.com/jgong5, https://github.com/jansel

Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes pytorch#89492 Fixes pytorch#75240 Pull Request resolved: pytorch#136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby, https://github.com/eqy

…h#136867) Pull Request resolved: pytorch#136867 Approved by: https://github.com/Chillee

Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants. Test Plan: fixed test Differential Revision: D64081786 Pull Request resolved: pytorch#137547 Approved by: https://github.com/tugsbayasgalan

…rch#137606) Some unit tests were failing relating to argmin_vec/argmax_vec due to a bug in GCC affecting versions <= 12 on aarch64 platforms with SVE https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 Fixes pytorch#137597 Pull Request resolved: pytorch#137606 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <[email protected]>

Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header. Test Plan: buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom data_loader.dataset.table_ds=[2024-09-04] data_loader.dataset.batch_size=512 max_ind_range=10 w/o this diff, it'll show 2.18 nccl version Differential Revision: D62371434 Pull Request resolved: pytorch#135472 Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa

…when unbacked (pytorch#137097)" This reverts commit 4304c68. Reverted pytorch#137097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, it seems to increase the compilation time a lot causing some jobs to timeout ([comment](pytorch#137097 (comment)))

…/stock_pt

@zdevito

…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj

Summary: OSS flight recorder does not work because we renamed `trace_dir` to `folder` in the internal version to reuse the loader. Fixes item #2 in reported issue: pytorch#140879 Test Plan: BEFORE: ``` ❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node1_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 44, in main details, version = read_dir(args) File "/home/cpio/local/pytorch/tools/flight_recorder/components/loader.py", line 89, in read_dir assert len(details) > 0, f"no files loaded from {args.folder} with prefix {prefix}" AttributeError: 'Namespace' object has no attribute 'folder' ``` AFTER: ``` python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main db = build_db(details, args, version) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db check_no_missing_dump_files(entries, memberships) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files dumps_ranks == all_ranks AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119} ❯ git status fatal: not a git repository (or any parent up to mount point /data/users/cpio) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). ❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_ tabulate is not installed. Proceeding without it. Traceback (most recent call last): File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module> main() File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main db = build_db(details, args, version) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db check_no_missing_dump_files(entries, memberships) File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files dumps_ranks == all_ranks AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119} ``` Differential Revision: D66117013 Pull Request resolved: pytorch#140973 Approved by: https://github.com/Skylion007, https://github.com/fduwjj

See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame #15: 0x0000000100ff1564 Python`pymain_main + 304 frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame #17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn

…143550) # Motivation Fix pytorch#143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame pytorch#20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame pytorch#21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: pytorch#143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD

…pytorch#144120) (pytorch#146372) Summary: # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: Dao-AILab/flash-attention#1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg Pull Request resolved: pytorch#146372 Approved by: https://github.com/jbschlosser

Summary: fix another combo kernel logging error: File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 2036, in _init self.create_combo_kernel_nodes(num_ck_nodes=None) File "/home/guorachel/local/fbsource/buck-out/v2/gen/fbcode/4bcbfa3ef39dbd6f/caffe2/test/inductor/__combo_kernels__/combo_kernels#link-tree/torch/_inductor/scheduler.py", line 3068, in create_combo_kernel_nodes log.debug("ComboKernels: Generating with num_ck_nodes = %d...", num_ck_nodes) Message: 'ComboKernels: Generating with num_ck_nodes = %d...' Arguments: (None,) Test Plan: Verified in test_combo_kernel.py the logging error went away. Differential Revision: D71655949 Pull Request resolved: pytorch#149772 Approved by: https://github.com/ColinPeppler, https://github.com/Skylion007

Use uint64_t index types to avoid ``` torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long' #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24 #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12 #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3 #3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3 ``` Pull Request resolved: pytorch#154809 Approved by: https://github.com/soulitzer

Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below ``` (lldb) up frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10 556 bool has_names() const { 557 // If a user is using unnamed tensors, then we can short-circuit right here. 558 // Otherwise, impl::has_names attempts to retrieve names. -> 559 if (!impl_->has_named_tensor_meta()) { 560 return false; 561 } 562 return impl::has_names(unsafeGetTensorImpl()); (lldb) up frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3 20 int64_t dimname_to_position(const Tensor& tensor, Dimname dim) { 21 TORCH_CHECK(dim.type() != NameType::WILDCARD, 22 "Please look up dimensions by name, got: name = None."); -> 23 TORCH_CHECK(tensor.has_names(), 24 "Name ", dim, " not found in ", toDimnameRepr(tensor), "."); 25 const auto names = tensor.names(); 26 ``` TODOs: - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable) - Replace `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests Fixes pytorch#155306 Pull Request resolved: pytorch#155383 Approved by: https://github.com/cyyever, https://github.com/ezyang ghstack dependencies: pytorch#155382

For tensor with non-zero offset, it must be multiplied by element size Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with ``` C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15 Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib) frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib) frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib) frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib) frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib) frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib) frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib) frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib) frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib) frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib) frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib) frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic) frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic) ``` Pull Request resolved: pytorch#158690 Approved by: https://github.com/angelayi

) Summary: This diff fixes two things which come up when testing a tgif-published pt2 model remote net: 1) Updates isSameDevice to handle meta device to avoid this error: ``` what(): Unsupported device typemeta and meta Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20 ``` 2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now. Test Plan: Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error: ``` Unsupported device typemeta and meta ``` Then after change #1 and before change #2 we get: ``` what(): Mismatched device for merge.user_tower.linear.weight: meta vs cpu Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374 ``` After change run is successful Command: ``` MODEL_ENTITY_ID=921242082 SNAPSHOT_ID=1269 module_name=merge SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge|cuda0" --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt ``` Rollback Plan: Differential Revision: D80713052 Pull Request resolved: pytorch#162842 Approved by: https://github.com/henryoier

desertfire and others added 30 commits October 4, 2024 00:35

Build clang18 image for ASAN tests (pytorch#128763)

6b9b2a1

Use the latest clang. Pull Request resolved: pytorch#128763 Approved by: https://github.com/malfet

clean up capture_func_transforms flag (pytorch#136960)

13ec343

Pull Request resolved: pytorch#136960 Approved by: https://github.com/guilhermeleobas, https://github.com/jansel

[numpy2.0 compat] Fix test_parse_numpy_int_overflow for NumPy 2.0 (py…

15e127b

…torch#137135) NumPy now throws an OverflowError when trying to create np.uint64(-1) Pull Request resolved: pytorch#137135 Approved by: https://github.com/Skylion007

clean up error_on_nested_jit_trace flag (pytorch#136961)

a1f1f58

Pull Request resolved: pytorch#136961 Approved by: https://github.com/jansel

Revert "Mark PyTorch module as no-gil valid and pythoncapi_compat.h (p…

9670e9e

…ytorch#136899)" This reverts commit 4f93de8. Reverted pytorch#136899 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#136899 (comment)))

type _dynamo/callback.py (pytorch#137181)

cfc51c8

Pull Request resolved: pytorch#137181 Approved by: https://github.com/Skylion007

Rollout infra for executorch migration to training IR (pytorch#132703)

97634e4

Title Differential Revision: [D60432217](https://our.internmc.facebook.com/intern/diff/D60432217/) Pull Request resolved: pytorch#132703 Approved by: https://github.com/tarun292

[torch.compile] torch.vmap supports dynamic shapes + enable flex atte…

a408cfc

…ntion create_block_mask dynamic shapes (pytorch#137163) Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#137163 Approved by: https://github.com/Chillee

[Environment Variable][2/N] Use thread-safe setenv wrapper (pytorch#1…

6327a71

…24485) This follows pytorch#119449 to make setenv thread-safe. Pull Request resolved: pytorch#124485 Approved by: https://github.com/eqy

Revert "[PyTorch] Port ExecuTorch bfdot improvement back to ATen Blas…

5509207

…Kernel (pytorch#136331)" This reverts commit 592e3a3. Reverted pytorch#136331 on behalf of https://github.com/albanD due to Breaks aarch64 builds, see link below ([comment](pytorch#136331 (comment)))

Revert "[Distributed] Fix extra context on device 0 (pytorch#135273)"

cd17b26

This reverts commit a93d387. Reverted pytorch#135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](pytorch#135273 (comment)))

Revert "Add meta functions for lerp, addcmul, and addcdiv. (pyt…

f56f747

…orch#136909)" This reverts commit e4b98b1. Reverted pytorch#136909 on behalf of https://github.com/albanD due to breaks trunk jobs ([comment](pytorch#136909 (comment)))

Enable clang-tidy on torch/csrc/distributed/rpc (pytorch#137320)

67908e9

Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#137320 Approved by: https://github.com/Skylion007

[RELAND] Fix unlift to preserve aliased constants (pytorch#137310)

d2d14d1

Differential Revision: [D63864743](https://our.internmc.facebook.com/intern/diff/D63864743) Pull Request resolved: pytorch#137310 Approved by: https://github.com/avikchaudhuri

janeyx99 and others added 18 commits October 10, 2024 01:23

Autoupdate min_lrs for ReduceLROnPlateau if possible, fixes pytorch#1…

f9ed39c

…04361 (pytorch#137637) Pull Request resolved: pytorch#137637 Approved by: https://github.com/albanD

Clean up op BC check list (pytorch#137634)

d170c41

Summary: Remove some stale items Test Plan: CI Differential Revision: D64133246 Pull Request resolved: pytorch#137634 Approved by: https://github.com/hl475

ATen | Fix header namespace pollution. (pytorch#137619)

96bab02

Summary: Fixing a warning, so we can enable it globally. Test Plan: Sandcastle-only, no runtime changes. Differential Revision: D64122115 Pull Request resolved: pytorch#137619 Approved by: https://github.com/Skylion007

[inductor] Enable coordinate descent tuning with max-autotune (pytorc…

4e89977

…h#136867) Pull Request resolved: pytorch#136867 Approved by: https://github.com/Chillee

Merge remote-tracking branch 'origin/daisyden/baseline' into daisyden…

76d2cee

…/stock_pt

daisyden closed this Oct 22, 2024

daisyden deleted the daisyden/upstream branch October 22, 2024 02:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Daisyden/upstream #2

Daisyden/upstream #2

Uh oh!

daisyden commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

93 participants

Daisyden/upstream #2

Daisyden/upstream #2

Uh oh!

Conversation

daisyden commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

93 participants