-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed as not planned
Description
Hi all, this issue will track the feature requests you've made to TensorRT-LLM & provide a place to see what TRT-LLM is currently working on.
Last update: Jan 14th, 2024
🚀 = in development
Models
Decoder Only
- 🚀 Zephyr-7B - Support for Zephyr 7B model #157
- DeciLM-7B - Support for other llm like Decilm? #853
- ChatGLM 3 - Support for ChatGLM3 plz #180, Support ChatGLM3 #270
- Mistral-7B - Mistral 7B support #49
- Mixtral-7B - Mixtral 8x 7B - MoE by Mistral AI #616
Encoder / Encoder-Decoder
- DeBERTa - Support for DeBerta #174
- RoBERTa - [FEA] Support Roberta model #124
- 🚀 BART, mBART - Cross-attention returns wrong results #285, [Feature request] Support MBartForCausalLM Request!!! #360
- FLAN-T5 - How can I use flan-t5-base? #251, Cross-attention returns wrong results #285, [Feature request] Support soft_prompt or inputs_embeds? #310
Multi-Modal
- BLIP2 + T5 - [Feature request] Support soft_prompt or inputs_embeds? #310, [feature request]Blip2 T5 support request #531
- LLaVa - Is it even possible to have multiple input layers #641,
- Qwen-VL - Does the repo suport qwen-vl? #728
- Generic Vision Encoder + LLM Support - Is it even possible to have multiple input layers #641, [Feature request] Support soft_prompt or inputs_embeds? #310
- BLIP2
- Whisper - Support non LLM transformer networks #323
Other
- YaRN - [Feature Request] support YaRN request #792
- Expert Caching - [Feature Request] Mixtral Offloading #849
- LoRA - Llama 2 with LoRA #68
- Mixtral - Mixtral 8x 7B - MoE by Mistral AI #616
Features & Optimizations
- Context Chunking - [Feature request] Dynamic splitfuse from Deepspeed (2x throughput) #317
- Speculative Decoding - Feature: Speculative sampling / Assisted Generation #169, Smaller available space for paged KV cache compared with vLLM #224, Falcon-40b build causing memory leaks and failure #226
implementation done - documentation in progress
KV Cache
- Reuse KV Cache - [Feature reuqest] support interactive-generation #292, Add automatic reuse of common key value cache blocks between requests #620
- Attention Sinks (StreamingLLM, H2O) - Attention sink #104
Quantization
- StarCoder INT8 SQ - Feature request: Support SmoothQuant variant of StarCoder #324
- Qwen INT4 - [Feature request] AutoAWQ support #345
- INT8 Weight only - Support weight only quantization from bfloat16 to int8? #110
Sampling
- 🚀 support
frequnecy_penalty- Support forfrequency_penalty#275 - Logit Manipulators - Add Transformers logits manipulators #241
- Combine
repetition&presencepenalties - Support for combiningrepetition_penalty,presence_penalty#274
Workflow
Front-ends
- OpenAI compatible API - Provide an interface similar to OpenAI API #334
- Flag for end-of-stream - Flag indicate end of stream #240
- Load from Buffer - GptManager add support for loading from buffer #144
- Paged KV Cache Utilization Metric - How to know the utility of paged kv cache ? #512
- Log Probabilities - Return log probabilities for tokens #238
- Return only new tokens - How to get the newly generated tokens only? #227
Integrations
- 🚀 LlamaIndex
- 🚀 LangChain
- Mojo - Question about a Mojo Integration #556
Usage / Installation
- pip install - waiting for pre-built wheel package #790,
Platform Support
nealvaidya, shannonphu, symphonylyh, byshiue, omer-dayan and 6 more