Sparse Marlin Kernel #2

petrex · 2024-10-18T03:05:15Z

No description provided.

lcskrishna · 2024-10-18T09:31:24Z

torchao/csrc/cuda/sparse_marlin/mem.h

  const int BYTES = 16;
  int src_in_bytes = (zfill ? 0 : BYTES);
+  #ifdef USE_ROCM
+  uint32_t smem = static_cast<uint32_t>(__builtin_amdgcn_s_getpc());


I guess we are not using smem_ptr here. we need to find a proper equivalent.

thanks , I have a fix following your suggestion.

Differential Revision: D67982501 Pull Request resolved: pytorch#1532

Differential Revision: D67777662 Pull Request resolved: pytorch#1490

* Add run_tutorials github action and fix existing errors Summary: Added a GHA button for release oncall to check tutorial code are runnable can also be enabled by add a tag `ciflow/tutorials` Test Plan: CI github action Reviewers: Subscribers: Tasks: Tags: * add yml * add script * revert profile changes

* Add support for eager mode performance Summary: Added "compile" filed to "extra_info" that allows us to record eager mode performance as well context is eager, eager + compile, eager + compile + autoquant can all have performance improvements/changes over time, so we want to track: (1) eager perf on some previous date (configurable by user) (2) current eager perf (3) current compile perf (4) current autoqunat + compile perf Test Plan: tested locally: https://gist.github.com/jerryzh168/2a15322b0c8f40f35e52956837c67fec Reviewers: Subscribers: Tasks: Tags: * move min_sqnr * format * remove redundant headers * add upload_to_s3 script * format

Summary: Removes temp build artifacts from experimental. Now the kernels are built and loaded with `USE_CPP=1 pip install .` from ao. Reviewed By: jerryzh168 Differential Revision: D67807207

* Add convert path for quantize_ QAT API Summary: pytorch#1415 added a quantize_ QAT API for the prepare path. This commit adds the remaining convert path for users to actually perform end-to-end QAT using the quantize_ API. The new flow will look like: ``` from torchao.quantization import ( quantize_, int8_dynamic_activation_int4_weight, ) from torchao.quantization.qat import ( FakeQuantizeConfig, from_intx_quantization_aware_training, intx_quantization_aware_training, ) activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=32) quantize_( my_model, intx_quantization_aware_training(activation_config, weight_config), ) quantize_(my_model, from_intx_quantization_aware_training()) quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32)) ``` Test Plan: python test/quantization/test_qat.py -k test_quantize_api_convert_path [ghstack-poisoned] * Update on "Add convert path for quantize_ QAT API" Summary: pytorch#1415 added a quantize_ QAT API for the prepare path. This commit adds the remaining convert path for users to actually perform end-to-end QAT using the quantize_ API. The new flow will look like: ``` from torchao.quantization import ( quantize_, int8_dynamic_activation_int4_weight, ) from torchao.quantization.qat import ( FakeQuantizeConfig, from_intx_quantization_aware_training, intx_quantization_aware_training, ) activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=32) quantize_( my_model, intx_quantization_aware_training(activation_config, weight_config), ) quantize_(my_model, from_intx_quantization_aware_training()) quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32)) ``` Test Plan: python test/quantization/test_qat.py -k test_quantize_api_convert_path [ghstack-poisoned] * Update on "Add convert path for quantize_ QAT API" Summary: pytorch#1415 added a quantize_ QAT API for the prepare path. This commit adds the remaining convert path for users to actually perform end-to-end QAT using the quantize_ API. The new flow will look like: ``` from torchao.quantization import ( quantize_, int8_dynamic_activation_int4_weight, ) from torchao.quantization.qat import ( FakeQuantizeConfig, from_intx_quantization_aware_training, intx_quantization_aware_training, ) activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=32) quantize_( my_model, intx_quantization_aware_training(activation_config, weight_config), ) quantize_(my_model, from_intx_quantization_aware_training()) quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32)) ``` Test Plan: python test/quantization/test_qat.py -k test_quantize_api_convert_path [ghstack-poisoned] * Update on "Add convert path for quantize_ QAT API" Summary: pytorch#1415 added a quantize_ QAT API for the prepare path. This commit adds the remaining convert path for users to actually perform end-to-end QAT using the quantize_ API. The new flow will look like: ``` from torchao.quantization import ( quantize_, int8_dynamic_activation_int4_weight, ) from torchao.quantization.qat import ( FakeQuantizeConfig, from_intx_quantization_aware_training, intx_quantization_aware_training, ) activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=32) quantize_( my_model, intx_quantization_aware_training(activation_config, weight_config), ) quantize_(my_model, from_intx_quantization_aware_training()) quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32)) ``` Test Plan: python test/quantization/test_qat.py -k test_quantize_api_convert_path [ghstack-poisoned]

* Add convert path for quantize_ QAT API Summary: pytorch#1415 added a quantize_ QAT API for the prepare path. This commit adds the remaining convert path for users to actually perform end-to-end QAT using the quantize_ API. The new flow will look like: ``` from torchao.quantization import ( quantize_, int8_dynamic_activation_int4_weight, ) from torchao.quantization.qat import ( FakeQuantizeConfig, from_intx_quantization_aware_training, intx_quantization_aware_training, ) activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=32) quantize_( my_model, intx_quantization_aware_training(activation_config, weight_config), ) quantize_(my_model, from_intx_quantization_aware_training()) quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32)) ``` Test Plan: python test/quantization/test_qat.py -k test_quantize_api_convert_path [ghstack-poisoned] * Update QAT READMEs using new APIs Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned] * Update base for Update on "Update QAT READMEs using new APIs" Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned] * Update base for Update on "Update QAT READMEs using new APIs" Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned] * Update base for Update on "Update QAT READMEs using new APIs" Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned] * Update base for Update on "Update QAT READMEs using new APIs" Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned] * Update base for Update on "Update QAT READMEs using new APIs" Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned]

…h#1794) * Updating Cuda 12.1/12.4 to 12.4/12.6 to reflect current state We haven't release 12.1 binaries since 0.7.0 https://download.pytorch.org/whl/torchao/ https://download.pytorch.org/whl/nightly/torchao/ * Update README.md Co-authored-by: Andrey Talman <[email protected]> * Update README.md --------- Co-authored-by: Andrey Talman <[email protected]>

* Fixing DORA imports Summary: these imports were pointing at nothing Test Plan: python test/dora/test_dora_fusion.py Reviewers: Subscribers: Tasks: Tags: * fixing lint issues Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

stack-info: PR: pytorch#1530, branch: drisspg/stack/26

* bugfix clean_release_notes.py 1) Developers name needs to be consistent or else it wont find that dict entry 2) need to handle escape char of regex * Update clean_release_notes.py

…layout" (pytorch#1803) Revert "Add support for copy_ for plain layout and tensor core tiled layout (…" This reverts commit 79e3366.

* add float8 training benchmarking scripts * move to benchmarks/float8/training

* Silence loud commit * Update intmm.py

WIP delete DORA

Revert "Use exp2 for mx scaling (pytorch#1530)" This reverts commit 890e0ac.

* up * up

* CPUOffload: only offload parameters above a certain size * lint * ruff --------- Co-authored-by: Mark Saroufim <[email protected]>

* update typehint Signed-off-by: Masaki Kozuki <[email protected]> * Update float8_linear_utils.py --------- Signed-off-by: Masaki Kozuki <[email protected]> Co-authored-by: Mark Saroufim <[email protected]>

…rch#1789) * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned]

init

* init * up * up * up * up * up * up * up * up * up

fix float8nocompile ci workflow

@lcskrishna

This pull request introduces support for ROCm (Radeon Open Compute) in addition to CUDA for GPU acceleration. The changes primarily focus on enabling the build and execution of ROCm-specific code paths alongside existing CUDA paths. In this PR, I use `tensor_core_tiled_layout` as proof of concept, but it generalizes to other kernels (for example, fp6_llm or sparse_marlin) with minimum effort. Feedback are welcome co-author : @lcskrishna ## Features: # ROCm Support Integration: * [`setup.py`](diffhunk://#diff-60f61ab7a8d1910d86d9fda2261620314edcae5894d5aaa236b821c7256badd7R49-R53): Added detection for ROCm and adjusted the logic for compiling GPU extensions based on the availability of CUDA or ROCm. # Conditional Compilation for ROCm: * [`torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.cu`](diffhunk://#diff-29bb1a2fd9317c74c807a7f558f5de755af0def91b9a49c81c409f8e76f736ddL1-R1): Introduced conditional compilation directives to include ROCm-specific headers and adjust constants and operations for ROCm. These changes ensure that the codebase can compile and run efficiently on both CUDA and ROCm platforms, leveraging the best available GPU acceleration technology. ## Usage With ROCm pytorch nightly docker , simply run `PYTORCH_ROCM_ARCH=gfx942 python setup.py install ` ## Next - [ ] AMD specific unit tests (tensor_core_tiled_layout) - [ ] workload and platform specific optimization (tensor_core_tiled_layout)

fix ruff

- Update GPU architecture check to use `gcnArchName` instead of `name` - Modify architecture compatibility check to use `in` instead of exact match - Remove redundant ROCm GPU architecture check

- Simplify source file collection for CUDA and ROCm extensions - Conditionally remove CUTLASS-based kernels when not using CUTLASS - Clean up redundant path and source filtering logic - Use `cwd` consistently for path resolution

- Enhance GPU support detection and reporting - Add more informative logging for source file compilation - Refine conditional compilation logic for CUDA and ROCm - Provide clearer messages about build configuration

- Enhance source file discovery with informative print statements - Add debug logging to help diagnose source file collection issues - Improve visibility into CUDA and ROCm source file detection process - Include additional checks and logging for edge cases in source file discovery

petrex changed the title ~~sparse marlin~~ Sparse Marlin Kernel Oct 18, 2024

petrex marked this pull request as draft October 18, 2024 03:05

lcskrishna reviewed Oct 18, 2024

View reviewed changes

Peter Yeh and others added 9 commits January 6, 2025 16:06

add sparse_marlin kernel to the build

a2f1736

drop .h from conversion

f817edf

cp_asyc4_pred_zfill() AMD implementation

c9bc1bc

implement matching mem utility with amd GCN isa

16feff4

implement mma util with amd gcn isa

0b21555

enable rocm path

f23b194

update copy from global to lds

ecc3927

implement cvta_to_shared()

a80730b

consolidate code with cvta_to_shared()

d2c7ce4

petrex force-pushed the rocm_sparse_marlin branch from 00bc94d to d2c7ce4 Compare January 6, 2025 22:06

petrex and others added 2 commits January 8, 2025 12:17

Merge branch 'main' into rocm_sparse_marlin

15974c7

lint

a4e8c30

petrex force-pushed the rocm_sparse_marlin branch from 662bfe7 to a4e8c30 Compare January 8, 2025 23:06

Peter Y. Yeh added 2 commits January 9, 2025 16:06

add GPU arch check for MI300x

c678cb0

revert change in tensor_core_tile_layout.cu

08d1cfb

petrex force-pushed the rocm_sparse_marlin branch from 1f3b773 to 08d1cfb Compare January 9, 2025 22:34

jainapurva and others added 11 commits January 9, 2025 15:13

Skip tests on fbcode

b5b739b

Differential Revision: D67982501 Pull Request resolved: pytorch#1532

Make it easer to isolate test cases (pytorch#1537)

982141b

Fix failing docs build in CI (pytorch#1542)

cedadc7

torchao setup.py with cmake

9c2635b

Differential Revision: D67777662 Pull Request resolved: pytorch#1490

SAM2: Rerun batch size 1 experiments on latest nightly (pytorch#1543)

79979ec

Update run_tutorials.yml (pytorch#1550)

1651ffa

Remove temp build files from torchao (pytorch#1551)

ad61822

Summary: Removes temp build artifacts from experimental. Now the kernels are built and loaded with `USE_CPP=1 pip install .` from ao. Reviewed By: jerryzh168 Differential Revision: D67807207

HDCharles and others added 30 commits February 28, 2025 10:30

Fixing DORA imports (pytorch#1795)

ac832b0

* Fixing DORA imports Summary: these imports were pointing at nothing Test Plan: python test/dora/test_dora_fusion.py Reviewers: Subscribers: Tasks: Tags: * fixing lint issues Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Use exp2 for mx scaling (pytorch#1530)

890e0ac

stack-info: PR: pytorch#1530, branch: drisspg/stack/26

bugfix clean_release_notes.py (pytorch#1801)

3219318

* bugfix clean_release_notes.py 1) Developers name needs to be consistent or else it wont find that dict entry 2) need to handle escape char of regex * Update clean_release_notes.py

Revert "Add support for copy_ for plain layout and tensor core tiled …

4a4925f

…layout" (pytorch#1803) Revert "Add support for copy_ for plain layout and tensor core tiled layout (…" This reverts commit 79e3366.

metal lowbit kernels: pip install (pytorch#1785)

8f93751

[float8] add float8 training benchmarking scripts (pytorch#1802)

7963f9c

* add float8 training benchmarking scripts * move to benchmarks/float8/training

Silence loud error on torchao cpu builds (pytorch#1808)

3bc1dd4

* Silence loud commit * Update intmm.py

Delete DORA (pytorch#1815)

55600a1

WIP delete DORA

Revert "Use exp2 for mx scaling" (pytorch#1813)

914de78

Revert "Use exp2 for mx scaling (pytorch#1530)" This reverts commit 890e0ac.

Fix experimental CI (pytorch#1820)

bc54ae5

* up * up

Remove split_k kernel (pytorch#1816)

7b496c9

CPUOffload: only offload parameters above a certain size (pytorch#1720)

e2f4ab4

* CPUOffload: only offload parameters above a certain size * lint * ruff --------- Co-authored-by: Mark Saroufim <[email protected]>

update typehint (pytorch#1740)

2c2a590

* update typehint Signed-off-by: Masaki Kozuki <[email protected]> * Update float8_linear_utils.py --------- Signed-off-by: Masaki Kozuki <[email protected]> Co-authored-by: Mark Saroufim <[email protected]>

Move torchao/_models to benchmarks/_models (pytorch#1784)

81a2813

metal lowbit ops: ci (pytorch#1825)

173d9bf

Fix experimental CI (pytorch#1827)

e767713

init

Optionally enable KleidiAI + clean up setup.py flags (pytorch#1826)

9bcd73b

* init * up * up * up * up * up * up * up * up * up

Merge branch 'main' into rocm_sparse_marlin

8b34390

Fix float8nocompile CI workflow (pytorch#1695)

1ff8592

fix float8nocompile ci workflow

ruff fix for setup.py (pytorch#1833)

883dc65

fix ruff

Merge branch 'main' into rocm_sparse_marlin

75b6816

lint

8124a58

fix gpu_arch

29d1be6

Improve ROCm GPU architecture detection in setup.py

617e792

- Update GPU architecture check to use `gcnArchName` instead of `name` - Modify architecture compatibility check to use `in` instead of exact match - Remove redundant ROCm GPU architecture check

Refactor CUDA/ROCm source file handling in setup.py

3db4c4d

- Simplify source file collection for CUDA and ROCm extensions - Conditionally remove CUTLASS-based kernels when not using CUTLASS - Clean up redundant path and source filtering logic - Use `cwd` consistently for path resolution

Improve CUDA/ROCm extension build configuration

92fedc8

- Enhance GPU support detection and reporting - Add more informative logging for source file compilation - Refine conditional compilation logic for CUDA and ROCm - Provide clearer messages about build configuration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sparse Marlin Kernel #2

Sparse Marlin Kernel #2

Uh oh!

petrex commented Oct 18, 2024

Uh oh!

lcskrishna Oct 18, 2024

Uh oh!

petrex Oct 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

33 participants

Sparse Marlin Kernel #2

Are you sure you want to change the base?

Sparse Marlin Kernel #2

Uh oh!

Conversation

petrex commented Oct 18, 2024

Uh oh!

lcskrishna Oct 18, 2024

Choose a reason for hiding this comment

Uh oh!

petrex Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

33 participants