Custom cpu and cuda operators support

Hi. Currently I'm trying to implement some large language models (LLM) with TorchSharp and got a nice demo ([here](https://github.com/K024/llm-sharp)). But when moving forward to more features I found some lacking features required for LLMs:

### Custom operators

LLMs heavily depend on custom operators like flash attention, RMS norm and GPTQ int4 matmul for faster inference speed and reduced model size with quantization.

PyTorch allows defining custom operators with native c++ and cuda source files in two ways: pybind11 and [torch library](https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html#registering-the-custom-operator-with-torchscript). The latter one seems working fine with `torch.jit.script` and is potentially to be working with TorchSharp `torch.jit.compile` and `torch.ops.xxx`. But loading it requires calling a torch native method. Also, TorchSharp may have some specialized modules for custom ops.

BTW, [openai/triton](https://github.com/openai/triton) uses MLIR and LLVM to create custom ops, but is almost bound to python.

### NCCL ops

I've also tried to implement a thread-based distributed approach with TorchSharp (at [here](https://github.com/K024/llm-sharp/blob/main/LLM/Distributed/World.cs)). The required communication ops are: broadcast, scatter, gather and all-gather. I'm using the naive `_copy` operator to implement them, but are very slow. Is it possible to have these NCCL related ops provided?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Custom cpu and cuda operators support #1081

Custom operators

NCCL ops

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Custom cpu and cuda operators support #1081

Description

Custom operators

NCCL ops

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions