TensorRT-LLM Requests

Hi all, this issue will track the feature requests you've made to TensorRT-LLM & provide a place to see what TRT-LLM is currently working on.

Last update: `Jan 14th, 2024`
🚀  = in development

# Models
#### Decoder Only
- [x] 🚀 Zephyr-7B - #157
- [ ] DeciLM-7B - #853
- [x] ChatGLM 3 - #180, #270
- [x] Mistral-7B - #49
- [x] Mixtral-7B - #616

#### Encoder / Encoder-Decoder
- [ ] DeBERTa - #174
- [x] RoBERTa - #124
- [x] 🚀 BART, mBART - #285, #360
- [x] FLAN-T5 - #251, #285, #310

#### Multi-Modal
- [x] BLIP2 + T5 - #310, #531
- [x] LLaVa - #641, 
- [x] Qwen-VL - #728
- [x] _Generic Vision Encoder + LLM Support_ - #641, #310
- [x] BLIP2
- [x] Whisper - #323

#### Other
- [x]  YaRN - #792 
- [ ] Expert Caching - #849 
- [x] LoRA - #68
- [x] Mixtral - #616

# Features & Optimizations 
- [x] Context Chunking - #317
- [x] Speculative Decoding - #169, #224, #226
implementation [done ](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/e2e_grpc_speculative_decoding_client.py) - documentation in progress

#### KV Cache
- [x] Reuse KV Cache - #292, #620
- [x] Attention Sinks (StreamingLLM, H2O) - #104

#### Quantization
- [ ] StarCoder INT8 SQ - #324
- [x] Qwen INT4 - #345
- [x]  INT8 Weight only - #110

#### Sampling
- [x] 🚀 support `frequnecy_penalty` - #275
- [ ] Logit Manipulators - #241
- [x] Combine `repetition` & `presence` penalties - #274

# Workflow
#### Front-ends
- [x] OpenAI compatible API - #334
- [x] Flag for end-of-stream - #240
- [x] Load from Buffer - #144
- [x] Paged KV Cache Utilization Metric - #512
- [x] Log Probabilities - #238
- [x] Return only new tokens - #227

#### Integrations
- [ ] 🚀 LlamaIndex
- [ ] 🚀 LangChain
- [x] Mojo - #556

#### Usage / Installation
- [x] pip install - #790, 

# Platform Support
- [ ] Jetson - #62, #488, #619
- [ ] V100, T4 MHA - #320

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TensorRT-LLM Requests #632

Models

Decoder Only

Encoder / Encoder-Decoder

Multi-Modal

Other

Features & Optimizations

KV Cache

Quantization

Sampling

Workflow

Front-ends

Integrations

Usage / Installation

Platform Support

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TensorRT-LLM Requests #632

Description

Models

Decoder Only

Encoder / Encoder-Decoder

Multi-Modal

Other

Features & Optimizations

KV Cache

Quantization

Sampling

Workflow

Front-ends

Integrations

Usage / Installation

Platform Support

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions