forked from pytorch/ao
-
Notifications
You must be signed in to change notification settings - Fork 0
Sparse Marlin Kernel #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
petrex
wants to merge
194
commits into
rocm_enablement_staging
Choose a base branch
from
rocm_sparse_marlin
base: rocm_enablement_staging
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
lcskrishna
reviewed
Oct 18, 2024
| const int BYTES = 16; | ||
| int src_in_bytes = (zfill ? 0 : BYTES); | ||
| #ifdef USE_ROCM | ||
| uint32_t smem = static_cast<uint32_t>(__builtin_amdgcn_s_getpc()); |
Collaborator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we are not using smem_ptr here. we need to find a proper equivalent.
Owner
Author
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks , I have a fix following your suggestion.
00bc94d to
d2c7ce4
Compare
662bfe7 to
a4e8c30
Compare
1f3b773 to
08d1cfb
Compare
Differential Revision: D67982501 Pull Request resolved: pytorch#1532
Differential Revision: D67777662 Pull Request resolved: pytorch#1490
* Add run_tutorials github action and fix existing errors Summary: Added a GHA button for release oncall to check tutorial code are runnable can also be enabled by add a tag `ciflow/tutorials` Test Plan: CI github action Reviewers: Subscribers: Tasks: Tags: * add yml * add script * revert profile changes
* Add support for eager mode performance Summary: Added "compile" filed to "extra_info" that allows us to record eager mode performance as well context is eager, eager + compile, eager + compile + autoquant can all have performance improvements/changes over time, so we want to track: (1) eager perf on some previous date (configurable by user) (2) current eager perf (3) current compile perf (4) current autoqunat + compile perf Test Plan: tested locally: https://gist.github.com/jerryzh168/2a15322b0c8f40f35e52956837c67fec Reviewers: Subscribers: Tasks: Tags: * move min_sqnr * format * remove redundant headers * add upload_to_s3 script * format
Summary: Removes temp build artifacts from experimental. Now the kernels are built and loaded with `USE_CPP=1 pip install .` from ao. Reviewed By: jerryzh168 Differential Revision: D67807207
* Add convert path for quantize_ QAT API Summary: pytorch#1415 added a quantize_ QAT API for the prepare path. This commit adds the remaining convert path for users to actually perform end-to-end QAT using the quantize_ API. The new flow will look like: ``` from torchao.quantization import ( quantize_, int8_dynamic_activation_int4_weight, ) from torchao.quantization.qat import ( FakeQuantizeConfig, from_intx_quantization_aware_training, intx_quantization_aware_training, ) activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=32) quantize_( my_model, intx_quantization_aware_training(activation_config, weight_config), ) quantize_(my_model, from_intx_quantization_aware_training()) quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32)) ``` Test Plan: python test/quantization/test_qat.py -k test_quantize_api_convert_path [ghstack-poisoned] * Update on "Add convert path for quantize_ QAT API" Summary: pytorch#1415 added a quantize_ QAT API for the prepare path. This commit adds the remaining convert path for users to actually perform end-to-end QAT using the quantize_ API. The new flow will look like: ``` from torchao.quantization import ( quantize_, int8_dynamic_activation_int4_weight, ) from torchao.quantization.qat import ( FakeQuantizeConfig, from_intx_quantization_aware_training, intx_quantization_aware_training, ) activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=32) quantize_( my_model, intx_quantization_aware_training(activation_config, weight_config), ) quantize_(my_model, from_intx_quantization_aware_training()) quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32)) ``` Test Plan: python test/quantization/test_qat.py -k test_quantize_api_convert_path [ghstack-poisoned] * Update on "Add convert path for quantize_ QAT API" Summary: pytorch#1415 added a quantize_ QAT API for the prepare path. This commit adds the remaining convert path for users to actually perform end-to-end QAT using the quantize_ API. The new flow will look like: ``` from torchao.quantization import ( quantize_, int8_dynamic_activation_int4_weight, ) from torchao.quantization.qat import ( FakeQuantizeConfig, from_intx_quantization_aware_training, intx_quantization_aware_training, ) activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=32) quantize_( my_model, intx_quantization_aware_training(activation_config, weight_config), ) quantize_(my_model, from_intx_quantization_aware_training()) quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32)) ``` Test Plan: python test/quantization/test_qat.py -k test_quantize_api_convert_path [ghstack-poisoned] * Update on "Add convert path for quantize_ QAT API" Summary: pytorch#1415 added a quantize_ QAT API for the prepare path. This commit adds the remaining convert path for users to actually perform end-to-end QAT using the quantize_ API. The new flow will look like: ``` from torchao.quantization import ( quantize_, int8_dynamic_activation_int4_weight, ) from torchao.quantization.qat import ( FakeQuantizeConfig, from_intx_quantization_aware_training, intx_quantization_aware_training, ) activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=32) quantize_( my_model, intx_quantization_aware_training(activation_config, weight_config), ) quantize_(my_model, from_intx_quantization_aware_training()) quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32)) ``` Test Plan: python test/quantization/test_qat.py -k test_quantize_api_convert_path [ghstack-poisoned]
* Add convert path for quantize_ QAT API Summary: pytorch#1415 added a quantize_ QAT API for the prepare path. This commit adds the remaining convert path for users to actually perform end-to-end QAT using the quantize_ API. The new flow will look like: ``` from torchao.quantization import ( quantize_, int8_dynamic_activation_int4_weight, ) from torchao.quantization.qat import ( FakeQuantizeConfig, from_intx_quantization_aware_training, intx_quantization_aware_training, ) activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False) weight_config = FakeQuantizeConfig(torch.int4, group_size=32) quantize_( my_model, intx_quantization_aware_training(activation_config, weight_config), ) quantize_(my_model, from_intx_quantization_aware_training()) quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32)) ``` Test Plan: python test/quantization/test_qat.py -k test_quantize_api_convert_path [ghstack-poisoned] * Update QAT READMEs using new APIs Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned] * Update base for Update on "Update QAT READMEs using new APIs" Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned] * Update base for Update on "Update QAT READMEs using new APIs" Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned] * Update base for Update on "Update QAT READMEs using new APIs" Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned] * Update base for Update on "Update QAT READMEs using new APIs" Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned] * Update base for Update on "Update QAT READMEs using new APIs" Add references to new QAT APIs including `quantize_`, `FakeQuantizedX`, and the new embedding Quantizers and ComposableQATQuantizer. Also link to new QAT + LoRA recipe in torchtune. [ghstack-poisoned]
…h#1794) * Updating Cuda 12.1/12.4 to 12.4/12.6 to reflect current state We haven't release 12.1 binaries since 0.7.0 https://download.pytorch.org/whl/torchao/ https://download.pytorch.org/whl/nightly/torchao/ * Update README.md Co-authored-by: Andrey Talman <[email protected]> * Update README.md --------- Co-authored-by: Andrey Talman <[email protected]>
* Fixing DORA imports Summary: these imports were pointing at nothing Test Plan: python test/dora/test_dora_fusion.py Reviewers: Subscribers: Tasks: Tags: * fixing lint issues Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
stack-info: PR: pytorch#1530, branch: drisspg/stack/26
* bugfix clean_release_notes.py 1) Developers name needs to be consistent or else it wont find that dict entry 2) need to handle escape char of regex * Update clean_release_notes.py
…layout" (pytorch#1803) Revert "Add support for copy_ for plain layout and tensor core tiled layout (…" This reverts commit 79e3366.
* add float8 training benchmarking scripts * move to benchmarks/float8/training
* Silence loud commit * Update intmm.py
WIP delete DORA
Revert "Use exp2 for mx scaling (pytorch#1530)" This reverts commit 890e0ac.
* up * up
* CPUOffload: only offload parameters above a certain size * lint * ruff --------- Co-authored-by: Mark Saroufim <[email protected]>
* update typehint Signed-off-by: Masaki Kozuki <[email protected]> * Update float8_linear_utils.py --------- Signed-off-by: Masaki Kozuki <[email protected]> Co-authored-by: Mark Saroufim <[email protected]>
…rch#1789) * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned]
* init * up * up * up * up * up * up * up * up * up
fix float8nocompile ci workflow
This pull request introduces support for ROCm (Radeon Open Compute) in addition to CUDA for GPU acceleration. The changes primarily focus on enabling the build and execution of ROCm-specific code paths alongside existing CUDA paths. In this PR, I use `tensor_core_tiled_layout` as proof of concept, but it generalizes to other kernels (for example, fp6_llm or sparse_marlin) with minimum effort. Feedback are welcome co-author : @lcskrishna ## Features: # ROCm Support Integration: * [`setup.py`](diffhunk://#diff-60f61ab7a8d1910d86d9fda2261620314edcae5894d5aaa236b821c7256badd7R49-R53): Added detection for ROCm and adjusted the logic for compiling GPU extensions based on the availability of CUDA or ROCm. # Conditional Compilation for ROCm: * [`torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.cu`](diffhunk://#diff-29bb1a2fd9317c74c807a7f558f5de755af0def91b9a49c81c409f8e76f736ddL1-R1): Introduced conditional compilation directives to include ROCm-specific headers and adjust constants and operations for ROCm. These changes ensure that the codebase can compile and run efficiently on both CUDA and ROCm platforms, leveraging the best available GPU acceleration technology. ## Usage With ROCm pytorch nightly docker , simply run `PYTORCH_ROCM_ARCH=gfx942 python setup.py install ` ## Next - [ ] AMD specific unit tests (tensor_core_tiled_layout) - [ ] workload and platform specific optimization (tensor_core_tiled_layout)
- Update GPU architecture check to use `gcnArchName` instead of `name` - Modify architecture compatibility check to use `in` instead of exact match - Remove redundant ROCm GPU architecture check
- Simplify source file collection for CUDA and ROCm extensions - Conditionally remove CUTLASS-based kernels when not using CUTLASS - Clean up redundant path and source filtering logic - Use `cwd` consistently for path resolution
- Enhance GPU support detection and reporting - Add more informative logging for source file compilation - Refine conditional compilation logic for CUDA and ROCm - Provide clearer messages about build configuration
- Enhance source file discovery with informative print statements - Add debug logging to help diagnose source file collection issues - Improve visibility into CUDA and ROCm source file detection process - Include additional checks and logging for edge cases in source file discovery
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.