Skip to content

Conversation

@petrex
Copy link
Owner

@petrex petrex commented Oct 18, 2024

No description provided.

@petrex petrex changed the title sparse marlin Sparse Marlin Kernel Oct 18, 2024
@petrex petrex marked this pull request as draft October 18, 2024 03:05
const int BYTES = 16;
int src_in_bytes = (zfill ? 0 : BYTES);
#ifdef USE_ROCM
uint32_t smem = static_cast<uint32_t>(__builtin_amdgcn_s_getpc());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we are not using smem_ptr here. we need to find a proper equivalent.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks , I have a fix following your suggestion.

@petrex petrex force-pushed the rocm_sparse_marlin branch from 00bc94d to d2c7ce4 Compare January 6, 2025 22:06
@petrex petrex force-pushed the rocm_sparse_marlin branch from 662bfe7 to a4e8c30 Compare January 8, 2025 23:06
@petrex petrex force-pushed the rocm_sparse_marlin branch from 1f3b773 to 08d1cfb Compare January 9, 2025 22:34
jainapurva and others added 11 commits January 9, 2025 15:13
Differential Revision: D67982501

Pull Request resolved: pytorch#1532
Differential Revision: D67777662

Pull Request resolved: pytorch#1490
* Add run_tutorials github action and fix existing errors

Summary:
Added a GHA button for release oncall to check tutorial code are runnable
can also be enabled by add a tag `ciflow/tutorials`

Test Plan:
CI github action

Reviewers:

Subscribers:

Tasks:

Tags:

* add yml

* add script

* revert profile changes
* Add support for eager mode performance

Summary:

Added "compile" filed to "extra_info" that allows us to record eager mode performance as well

context is eager, eager + compile, eager + compile + autoquant can all have performance improvements/changes over time, so we want to track:

(1) eager perf on some previous date (configurable by user)
(2) current eager perf
(3) current compile perf
(4) current autoqunat + compile perf

Test Plan:
tested locally:
https://gist.github.com/jerryzh168/2a15322b0c8f40f35e52956837c67fec

Reviewers:

Subscribers:

Tasks:

Tags:

* move min_sqnr

* format

* remove redundant headers

* add upload_to_s3 script

* format
Summary: Removes temp build artifacts from experimental.  Now the kernels are built and loaded with `USE_CPP=1 pip install .` from ao.

Reviewed By: jerryzh168

Differential Revision: D67807207
* Add convert path for quantize_ QAT API

Summary: pytorch#1415 added a quantize_
QAT API for the prepare path. This commit adds the remaining
convert path for users to actually perform end-to-end QAT using
the quantize_ API. The new flow will look like:

```
from torchao.quantization import (
    quantize_,
    int8_dynamic_activation_int4_weight,
)
from torchao.quantization.qat import (
    FakeQuantizeConfig,
    from_intx_quantization_aware_training,
    intx_quantization_aware_training,
)

activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = FakeQuantizeConfig(torch.int4, group_size=32)
quantize_(
    my_model,
    intx_quantization_aware_training(activation_config, weight_config),
)

quantize_(my_model, from_intx_quantization_aware_training())
quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32))
```

Test Plan:
python test/quantization/test_qat.py -k test_quantize_api_convert_path

[ghstack-poisoned]

* Update on "Add convert path for quantize_ QAT API"


Summary: pytorch#1415 added a quantize_
QAT API for the prepare path. This commit adds the remaining
convert path for users to actually perform end-to-end QAT using
the quantize_ API. The new flow will look like:

```
from torchao.quantization import (
    quantize_,
    int8_dynamic_activation_int4_weight,
)
from torchao.quantization.qat import (
    FakeQuantizeConfig,
    from_intx_quantization_aware_training,
    intx_quantization_aware_training,
)

activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = FakeQuantizeConfig(torch.int4, group_size=32)
quantize_(
    my_model,
    intx_quantization_aware_training(activation_config, weight_config),
)

quantize_(my_model, from_intx_quantization_aware_training())
quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32))
```

Test Plan:
python test/quantization/test_qat.py -k test_quantize_api_convert_path

[ghstack-poisoned]

* Update on "Add convert path for quantize_ QAT API"


Summary: pytorch#1415 added a quantize_
QAT API for the prepare path. This commit adds the remaining
convert path for users to actually perform end-to-end QAT using
the quantize_ API. The new flow will look like:

```
from torchao.quantization import (
    quantize_,
    int8_dynamic_activation_int4_weight,
)
from torchao.quantization.qat import (
    FakeQuantizeConfig,
    from_intx_quantization_aware_training,
    intx_quantization_aware_training,
)

activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = FakeQuantizeConfig(torch.int4, group_size=32)
quantize_(
    my_model,
    intx_quantization_aware_training(activation_config, weight_config),
)

quantize_(my_model, from_intx_quantization_aware_training())
quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32))
```

Test Plan:
python test/quantization/test_qat.py -k test_quantize_api_convert_path

[ghstack-poisoned]

* Update on "Add convert path for quantize_ QAT API"


Summary: pytorch#1415 added a quantize_
QAT API for the prepare path. This commit adds the remaining
convert path for users to actually perform end-to-end QAT using
the quantize_ API. The new flow will look like:

```
from torchao.quantization import (
    quantize_,
    int8_dynamic_activation_int4_weight,
)
from torchao.quantization.qat import (
    FakeQuantizeConfig,
    from_intx_quantization_aware_training,
    intx_quantization_aware_training,
)

activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = FakeQuantizeConfig(torch.int4, group_size=32)
quantize_(
    my_model,
    intx_quantization_aware_training(activation_config, weight_config),
)

quantize_(my_model, from_intx_quantization_aware_training())
quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32))
```

Test Plan:
python test/quantization/test_qat.py -k test_quantize_api_convert_path

[ghstack-poisoned]
* Add convert path for quantize_ QAT API

Summary: pytorch#1415 added a quantize_
QAT API for the prepare path. This commit adds the remaining
convert path for users to actually perform end-to-end QAT using
the quantize_ API. The new flow will look like:

```
from torchao.quantization import (
    quantize_,
    int8_dynamic_activation_int4_weight,
)
from torchao.quantization.qat import (
    FakeQuantizeConfig,
    from_intx_quantization_aware_training,
    intx_quantization_aware_training,
)

activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = FakeQuantizeConfig(torch.int4, group_size=32)
quantize_(
    my_model,
    intx_quantization_aware_training(activation_config, weight_config),
)

quantize_(my_model, from_intx_quantization_aware_training())
quantize_(my_model, int8_dynamic_activation_int4_weight(group_size=32))
```

Test Plan:
python test/quantization/test_qat.py -k test_quantize_api_convert_path

[ghstack-poisoned]

* Update QAT READMEs using new APIs

Add references to new QAT APIs including `quantize_`,
`FakeQuantizedX`, and the new embedding Quantizers and
ComposableQATQuantizer. Also link to new QAT + LoRA recipe
in torchtune.

[ghstack-poisoned]

* Update base for Update on "Update QAT READMEs using new APIs"


Add references to new QAT APIs including `quantize_`,
`FakeQuantizedX`, and the new embedding Quantizers and
ComposableQATQuantizer. Also link to new QAT + LoRA recipe
in torchtune.

[ghstack-poisoned]

* Update base for Update on "Update QAT READMEs using new APIs"


Add references to new QAT APIs including `quantize_`,
`FakeQuantizedX`, and the new embedding Quantizers and
ComposableQATQuantizer. Also link to new QAT + LoRA recipe
in torchtune.

[ghstack-poisoned]

* Update base for Update on "Update QAT READMEs using new APIs"


Add references to new QAT APIs including `quantize_`,
`FakeQuantizedX`, and the new embedding Quantizers and
ComposableQATQuantizer. Also link to new QAT + LoRA recipe
in torchtune.

[ghstack-poisoned]

* Update base for Update on "Update QAT READMEs using new APIs"


Add references to new QAT APIs including `quantize_`,
`FakeQuantizedX`, and the new embedding Quantizers and
ComposableQATQuantizer. Also link to new QAT + LoRA recipe
in torchtune.

[ghstack-poisoned]

* Update base for Update on "Update QAT READMEs using new APIs"


Add references to new QAT APIs including `quantize_`,
`FakeQuantizedX`, and the new embedding Quantizers and
ComposableQATQuantizer. Also link to new QAT + LoRA recipe
in torchtune.

[ghstack-poisoned]
HDCharles and others added 30 commits February 28, 2025 10:30
…h#1794)

* Updating Cuda 12.1/12.4 to 12.4/12.6 to reflect current state

We haven't release 12.1 binaries since 0.7.0 

https://download.pytorch.org/whl/torchao/
https://download.pytorch.org/whl/nightly/torchao/

* Update README.md

Co-authored-by: Andrey Talman <[email protected]>

* Update README.md

---------

Co-authored-by: Andrey Talman <[email protected]>
* Fixing DORA imports

Summary: these imports were pointing at nothing

Test Plan: python test/dora/test_dora_fusion.py

Reviewers:

Subscribers:

Tasks:

Tags:

* fixing lint issues

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
stack-info: PR: pytorch#1530, branch: drisspg/stack/26
* bugfix clean_release_notes.py

1) Developers name needs to be consistent or else it wont find that dict entry

2) need to handle escape char of regex

* Update clean_release_notes.py
…layout" (pytorch#1803)

Revert "Add support for copy_ for plain layout and tensor core tiled layout (…"

This reverts commit 79e3366.
* add float8 training benchmarking scripts

* move to benchmarks/float8/training
* Silence loud commit

* Update intmm.py
Revert "Use exp2 for mx scaling (pytorch#1530)"

This reverts commit 890e0ac.
* CPUOffload: only offload parameters above a certain size

* lint

* ruff

---------

Co-authored-by: Mark Saroufim <[email protected]>
* update typehint

Signed-off-by: Masaki Kozuki <[email protected]>

* Update float8_linear_utils.py

---------

Signed-off-by: Masaki Kozuki <[email protected]>
Co-authored-by: Mark Saroufim <[email protected]>
…rch#1789)

* Update

[ghstack-poisoned]

* Update

[ghstack-poisoned]

* Update

[ghstack-poisoned]

* Update

[ghstack-poisoned]

* Update

[ghstack-poisoned]

* Update

[ghstack-poisoned]

* Update

[ghstack-poisoned]

* Update

[ghstack-poisoned]

* Update

[ghstack-poisoned]

* Update

[ghstack-poisoned]
* init

* up

* up

* up

* up

* up

* up

* up

* up

* up
fix float8nocompile ci workflow
This pull request introduces support for ROCm (Radeon Open Compute) in addition to CUDA for GPU acceleration. The changes primarily focus on enabling the build and execution of ROCm-specific code paths alongside existing CUDA paths. 

In this PR, I use `tensor_core_tiled_layout` as proof of concept, but it generalizes to other kernels (for example, fp6_llm or sparse_marlin) with minimum effort. Feedback are welcome

co-author : @lcskrishna

## Features:

# ROCm Support Integration:

* [`setup.py`](diffhunk://#diff-60f61ab7a8d1910d86d9fda2261620314edcae5894d5aaa236b821c7256badd7R49-R53): Added detection for ROCm and adjusted the logic for compiling GPU extensions based on the availability of CUDA or ROCm.
# Conditional Compilation for ROCm:

* [`torchao/csrc/cuda/tensor_core_tiled_layout/tensor_core_tiled_layout.cu`](diffhunk://#diff-29bb1a2fd9317c74c807a7f558f5de755af0def91b9a49c81c409f8e76f736ddL1-R1): Introduced conditional compilation directives to include ROCm-specific headers and adjust constants and operations for ROCm. 

These changes ensure that the codebase can compile and run efficiently on both CUDA and ROCm platforms, leveraging the best available GPU acceleration technology.

## Usage
With ROCm pytorch nightly docker , simply run `PYTORCH_ROCM_ARCH=gfx942 python setup.py install `

## Next

- [ ] AMD specific unit tests (tensor_core_tiled_layout)
- [ ] workload and platform specific optimization (tensor_core_tiled_layout)
- Update GPU architecture check to use `gcnArchName` instead of `name`
- Modify architecture compatibility check to use `in` instead of exact match
- Remove redundant ROCm GPU architecture check
- Simplify source file collection for CUDA and ROCm extensions
- Conditionally remove CUTLASS-based kernels when not using CUTLASS
- Clean up redundant path and source filtering logic
- Use `cwd` consistently for path resolution
- Enhance GPU support detection and reporting
- Add more informative logging for source file compilation
- Refine conditional compilation logic for CUDA and ROCm
- Provide clearer messages about build configuration
- Enhance source file discovery with informative print statements
- Add debug logging to help diagnose source file collection issues
- Improve visibility into CUDA and ROCm source file detection process
- Include additional checks and logging for edge cases in source file discovery
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.