Figure 1: Timeline of Publicly Available and Unavailable Unified Multimodal Models. The models are categorized by their release years, from 2023 to 2025. Models underlined in the diagram represent any-to-any multimodal models, capable of handling inputs or outputs beyond text and image, such as audio, video, and speech. The timeline highlights the rapid growth in this field.
We are looking for both interns and full-time researchers to join our team, focusing on multimodal understanding, generation, reasoning, AI agents, and unified multimodal models. If you are interested in exploring these exciting areas, please reach out to us at [email protected].
This repository provides a comprehensive collection of resources related to unified multimodal models, featuring:
- A survey of advances, challenges, and timelines for unified models
- Categorized lists of diffusion-based, autoregressive (MLLM), and hybrid architectures for unified image–text understanding and generation
- Benchmarks for evaluating multimodal comprehension, image generation, and interleaved image–text tasks
- Representative datasets covering multimodal understanding, text-to-image synthesis, image editing, and interleaved interactions
Designed to help researchers and practitioners explore, compare, and build state-of-the-art unified multimodal systems.
- Text-and-Image Unified Models
- Any-to-Any Multimodal models
- Benchmark for Evaluation
- Dataset
- Applications and Opportunities
Figure 2: Classification of Unified Multimodal Understanding and Generation Models. The models are divided into three main categories based on their backbone architecture: Diffusion, MLLM (AR), and MLLM (AR + Diffusion). Each category is further subdivided according to the encoding strategy employed, including Pixel Encoding, Semantic Encoding, Learnable Query Encoding, and Hybrid Encoding. We illustrate the architectural variations within these categories and their corresponding encoder-decoder configurations.
Name | Title | Venue | Date | Code | Demo |
---|---|---|---|---|---|
MMaDA | MMaDA: Multimodal Large Diffusion Language Models |
arXiv | 2025/05/22 | Github | Demo |
Dual Diffusion | Dual Diffusion for Unified Image Generation and Understanding |
arXiv | 2024/12/31 | Github | - |
Name | Title | Venue | Date | Code | Demo |
---|---|---|---|---|---|
Ming-Lite-Uni | Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction |
arXiv | 2025/05/05 | Github | - |
Nexus-Gen | Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing |
arXiv | 2025/04/30 | Github | Demo |
MetaQueries | Transfer between Modalities with MetaQueries | arXiv | 2025/04/08 | - | - |
SEED-X | SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation |
arXiv | 2024/04/22 | Github | Demo |
SEED-LLaMA | Making LLaMA SEE and Draw with SEED Tokenizer |
ICLR | 2023/10/02 | Github | Demo |
SEED | Planting a SEED of Vision in Large Language Model |
arXiv | 2023/07/16 | Github | Demo |
Name | Title | Venue | Date | Code | Demo |
---|---|---|---|---|---|
UniFluid | Unified Autoregressive Visual Generation and Understanding with Continuous Tokens | arXiv | 2025/03/17 | - | - |
OmniMamba | OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models |
arXiv | 2025/03/11 | Github | - |
Janus-Pro | Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling |
arXiv | 2025/01/29 | Github | Demo |
Janus | Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation |
arXiv | 2024/10/17 | Github | Demo |
Name | Title | Venue | Date | Code | Demo |
---|---|---|---|---|---|
UniToken | UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding |
CVPRW | 2025/04/06 | Github | - |
VARGPT-v1.1 | VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning |
arXiv | 2025/04/03 | Github | - |
ILLUME+ | ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement |
arXiv | 2025/04/02 | Github | - |
DualToken | DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies |
arXiv | 2025/03/18 | Github | - |
VARGPT | VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model |
arXiv | 2025/01/21 | Github | - |
TokenFlow | TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation |
CVPR | 2024/12/04 | Github | - |
MUSE-VL | MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding | arXiv | 2024/11/26 | - | - |
Name | Title | Venue | Date | Code | Demo |
---|---|---|---|---|---|
LMFusion | LMFusion: Adapting Pretrained Language Models for Multimodal Generation | arXiv | 2024/12/19 | - | - |
MonoFormer | MonoFormer: One Transformer for Both Diffusion and Autoregression |
arXiv | 2024/09/24 | Github | - |
Show-o | Show-o: One Single Transformer to Unify Multimodal Understanding and Generation |
ICLR | 2024/08/22 | Github | Demo |
Transfusion | Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model |
ICLR | 2024/08/20 | Github | - |
Name | Title | Venue | Date | Code | Demo |
---|---|---|---|---|---|
BAGEL | Emerging Properties in Unified Multimodal Pretraining |
arXiv | 2025/05/20 | Github | Demo |
Mogao | Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation | arXiv | 2025/05/08 | - | - |
JanusFlow | JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation |
arXiv | 2024/11/12 | Github | Demo |
Name | Title | Venue | Date | Code | Demo |
---|---|---|---|---|---|
M2-omni | M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance | arXiv | 2025/02/26 | - | - |
OmniFlow | OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows |
CVPR | 2024/12/02 | Github | - |
Spider | Spider: Any-to-Many Multimodal LLM |
arXiv | 2024/11/14 | Github | - |
MIO | MIO: A Foundation Model on Multimodal Tokens |
arXiv | 2024/09/26 | Github | |
X-VILA | X-VILA: Cross-Modality Alignment for Large Language Model | arXiv | 2024/05/29 | - | - |
AnyGPT | AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling |
arXiv | 2024/02/19 | Github | - |
Video-LaVIT | Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization |
ICML | 2024/02/05 | Github | - |
Unified-IO 2 | Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action |
CVPR | 2023/12/28 | Github | - |
NExT-GPT | NExT-GPT: Any-to-Any Multimodal LLM |
ICML | 2023/09/11 | Github | - |
Name | Paper | Venue | Date | Code |
---|---|---|---|---|
VTBench | VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation |
arXiv | 2025/05/19 | Github |
UniBench | UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation |
arXiv | 2025/05/15 | Github |
OpenING | OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation |
CVPR | 2024/11/27 | Github |
ISG | Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment |
ICLR | 2024/11/26 | Github |
MMIE | MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models |
ICLR | 2024/10/14 | Github |
InterleavedBench | Holistic Evaluation for Interleaved Text-and-Image Generation | EMNLP | 2024/06/20 | HuggingFace |
OpenLEAF | OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation | MM | 2023/10/01 | - |
Dataset | Samples | Paper | Venue | Date |
---|---|---|---|---|
AnyEdit | 2.5M | Anyedit: Mastering unified high-quality image editing for any idea | CVPR | 2024/11/24 |
OmniEdit | 1.2M | Omniedit: Building image editing generalist models through specialist supervision | ICLR | 2024/11/11 |
UltraEdit | 4M | Ultraedit: Instruction-based fine-grained image editing at scale | NeurIPS | 2024/07/07 |
SEED-Data-Edit | 3.7M | Seed-data-edit technical report: A hybrid dataset for instructional image editing | arXiv | 2024/05/07 |
HQ-Edit | 197K | Hq-edit: A high-quality dataset for instruction-based image editing | arXiv | 2024/04/15 |
Magicbrush | 10K | Magicbrush: A manually annotated dataset for instruction-guided image editing | NeurIPS | 2023/06/16 |
InstructP2P | 313K | Instructpix2pix: Learning to follow image editing instructions | CVPR | 2022/11/17 |
Dataset | Samples | Paper | Venue | Date |
---|---|---|---|---|
CoMM | 227K | Comm: A coherent interleaved image-text dataset for multimodal understanding and generation | CVPR | 2024/06/15 |
OBELICS | 141M | Obelics: An open web-scale filtered dataset of interleaved image-text documents | NeurIPS | 2023/06/21 |
Multimodal C4 | 101.2M | Multimodal c4: An open, billion-scale corpus of images interleaved with text | NeurIPS | 2023/04/14 |
Dataset | Samples | Paper | Venue | Date |
---|---|---|---|---|
SynCD | 95K | Generating multi-image synthetic data for text-to-image customization | arXiv | 2025/02/03 |
X2I-subject-driven | 2.5M | OmniGen: Unified Image Generation | arXiv | 2024/12/14 |
Subjects200K | 200K | Ominicontrol: Minimal and universal control for diffusion transformer | arXiv | 2024/11/22 |
MultiGen-20M | 20M | Unicontrol: A unified diffusion model for controllable visual generation in the wild | NeurIPS | 2023/05/18 |
LAION-Face | 50M | General facial representation learning in a visual-linguistic manner | CVPR | 2021/12/06 |
Name | Title | Venue | Date | Code | Demo |
---|---|---|---|---|---|
UniCTokens | UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens |
arXiv | 2025/05/20 | Github | - |
Fair-UMLLM | On Fairness of Unified Multimodal Large Language Model for Image Generation | arXiv | 2025/02/05 | - | - |
T2I-R1 | T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT |
arXiv | 2025/01/29 | Github | - |
If you find this repo is helpful for your research, please cite our paper:
@article{zhang2025unified,
title={Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities},
author={Zhang, Xinjie and Guo, Jintao and Zhao, Shanshan and Fu, Minghao and Duan, Lunhao and Wang, Guo-Hua and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu},
journal={arXiv preprint arXiv:2505.02567},
year={2025}
}