Skip to content

Custom cpu and cuda operators support #1081

@K024

Description

@K024

Hi. Currently I'm trying to implement some large language models (LLM) with TorchSharp and got a nice demo (here). But when moving forward to more features I found some lacking features required for LLMs:

Custom operators

LLMs heavily depend on custom operators like flash attention, RMS norm and GPTQ int4 matmul for faster inference speed and reduced model size with quantization.

PyTorch allows defining custom operators with native c++ and cuda source files in two ways: pybind11 and torch library. The latter one seems working fine with torch.jit.script and is potentially to be working with TorchSharp torch.jit.compile and torch.ops.xxx. But loading it requires calling a torch native method. Also, TorchSharp may have some specialized modules for custom ops.

BTW, openai/triton uses MLIR and LLVM to create custom ops, but is almost bound to python.

NCCL ops

I've also tried to implement a thread-based distributed approach with TorchSharp (at here). The required communication ops are: broadcast, scatter, gather and all-gather. I'm using the naive _copy operator to implement them, but are very slow. Is it possible to have these NCCL related ops provided?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions