Skip to content

AIDC-AI/Awesome-Unified-Multimodal-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

Awesome Unified Multimodal Models

📚Survey • 🤗 HF Repo

Figure 1: Timeline of Publicly Available and Unavailable Unified Multimodal Models. The models are categorized by their release years, from 2023 to 2025. Models underlined in the diagram represent any-to-any multimodal models, capable of handling inputs or outputs beyond text and image, such as audio, video, and speech. The timeline highlights the rapid growth in this field.

🔥 We Are Hiring!

We are looking for both interns and full-time researchers to join our team, focusing on multimodal understanding, generation, reasoning, AI agents, and unified multimodal models. If you are interested in exploring these exciting areas, please reach out to us at [email protected].

👉 What is This Repo for?

This repository provides a comprehensive collection of resources related to unified multimodal models, featuring:

  • A survey of advances, challenges, and timelines for unified models
  • Categorized lists of diffusion-based, autoregressive (MLLM), and hybrid architectures for unified image–text understanding and generation
  • Benchmarks for evaluating multimodal comprehension, image generation, and interleaved image–text tasks
  • Representative datasets covering multimodal understanding, text-to-image synthesis, image editing, and interleaved interactions

Designed to help researchers and practitioners explore, compare, and build state-of-the-art unified multimodal systems.

Awesome Papers & Datasets

Text-and-Image Unified Models

Figure 2: Classification of Unified Multimodal Understanding and Generation Models. The models are divided into three main categories based on their backbone architecture: Diffusion, MLLM (AR), and MLLM (AR + Diffusion). Each category is further subdivided according to the encoding strategy employed, including Pixel Encoding, Semantic Encoding, Learnable Query Encoding, and Hybrid Encoding. We illustrate the architectural variations within these categories and their corresponding encoder-decoder configurations.

Diffusion

Name Title Venue Date Code Demo
MMaDA MMaDA: Multimodal Large Diffusion Language Models GitHub Repo stars arXiv 2025/05/22 Github Demo
Dual Diffusion Dual Diffusion for Unified Image Generation and Understanding GitHub Repo stars arXiv 2024/12/31 Github -

MLLM AR

b-1: Pixel Encoding
Name Title Venue Date Code Demo
Selftok Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning GitHub Repo stars arXiv 2025/05/12 Github -
TokLIP TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation GitHub Repo stars arXiv 2025/05/08 Github -
Harmon Harmonizing Visual Representations for Unified Multimodal Understanding and Generation GitHub Repo stars arXiv 2025/03/27 Github Demo
UGen UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning arXiv 2025/03/27 - -
SynerGen-VL SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding arXiv 2024/12/12 - -
Liquid Liquid: Language Models are Scalable and Unified Multi-modal Generators GitHub Repo stars arXiv 2024/12/05 Github Demo
Orthus Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads GitHub Repo stars arXiv 2024/11/28 Github -
MMAR MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling arXiv 2024/10/14 - -
Emu3 Emu3: Next-Token Prediction is All You Need GitHub Repo stars arXiv 2024/09/27 Github Demo
ANOLE ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation GitHub Repo stars arXiv 2024/07/08 Github -
Chameleon Chameleon: Mixed-Modal Early-Fusion Foundation Models GitHub Repo stars arXiv 2024/05/16 Github -
LWM World Model on Million-Length Video And Language With Blockwise RingAttention GitHub Repo stars ICLR 2024/02/13 Github -
b-2: Semantic Encoding
Name Title Venue Date Code Demo
BLIP3-o BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset GitHub Repo stars arXiv 2025/05/14 Github -
UniTok UniTok: A Unified Tokenizer for Visual Generation and Understanding GitHub Repo stars arXiv 2025/02/27 Github Demo
MetaMorph MetaMorph: Multimodal Understanding and Generation via Instruction Tuning GitHub Repo stars arXiv 2024/12/18 Github -
ILLUME ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance arXiv 2024/12/09 - -
PUMA PUMA: Empowering Unified MLLM with Multi-granular Visual Generation GitHub Repo stars arXiv 2024/10/17 Github -
VILA-U VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation GitHub Repo stars ICLR 2024/09/06 Github Demo
Mini-Gemini Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models GitHub Repo stars arXiv 2024/03/27 Github Demo
MM-Interleaved MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer GitHub Repo stars arXiv 2024/01/18 Github -
VL-GPT VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation GitHub Repo stars arXiv 2023/12/14 Github -
Emu2 Generative Multimodal Models are In-Context Learners GitHub Repo stars CVPR 2023/12/10 Github Demo
DreamLLM DreamLLM: Synergistic Multimodal Comprehension and Creation GitHub Repo stars ICLR 2023/09/20 Github -
LaVIT Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization GitHub Repo stars ICLR 2023/09/09 Github -
Emu Emu: Generative Pretraining in Multimodality GitHub Repo stars ICLR 2023/07/11 Github Demo
b-3: Learnable Query Encoding
Name Title Venue Date Code Demo
Ming-Lite-Uni Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction GitHub Repo stars arXiv 2025/05/05 Github -
Nexus-Gen Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing GitHub Repo stars arXiv 2025/04/30 Github Demo
MetaQueries Transfer between Modalities with MetaQueries arXiv 2025/04/08 - -
SEED-X SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation GitHub Repo stars arXiv 2024/04/22 Github Demo
SEED-LLaMA Making LLaMA SEE and Draw with SEED Tokenizer GitHub Repo stars ICLR 2023/10/02 Github Demo
SEED Planting a SEED of Vision in Large Language Model GitHub Repo stars arXiv 2023/07/16 Github Demo
b-4: Hybrid Encoding (Pseduo)
Name Title Venue Date Code Demo
UniFluid Unified Autoregressive Visual Generation and Understanding with Continuous Tokens arXiv 2025/03/17 - -
OmniMamba OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models GitHub Repo stars arXiv 2025/03/11 Github -
Janus-Pro Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling GitHub Repo stars arXiv 2025/01/29 Github Demo
Janus Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation GitHub Repo stars arXiv 2024/10/17 Github Demo
b-5: Hybrid Encoding (Joint)
Name Title Venue Date Code Demo
UniToken UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding GitHub Repo stars CVPRW 2025/04/06 Github -
VARGPT-v1.1 VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning GitHub Repo stars arXiv 2025/04/03 Github -
ILLUME+ ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement GitHub Repo stars arXiv 2025/04/02 Github -
DualToken DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies GitHub Repo stars arXiv 2025/03/18 Github -
VARGPT VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model GitHub Repo stars arXiv 2025/01/21 Github -
TokenFlow TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation GitHub Repo stars CVPR 2024/12/04 Github -
MUSE-VL MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding arXiv 2024/11/26 - -

MLLM AR-Diffusion

c-1: Pxiel Encoding
Name Title Venue Date Code Demo
LMFusion LMFusion: Adapting Pretrained Language Models for Multimodal Generation arXiv 2024/12/19 - -
MonoFormer MonoFormer: One Transformer for Both Diffusion and Autoregression GitHub Repo stars arXiv 2024/09/24 Github -
Show-o Show-o: One Single Transformer to Unify Multimodal Understanding and Generation GitHub Repo stars ICLR 2024/08/22 Github Demo
Transfusion Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model GitHub Repo stars ICLR 2024/08/20 Github -
c-2: Hybrid Encoding (Pseduo)
Name Title Venue Date Code Demo
BAGEL Emerging Properties in Unified Multimodal Pretraining GitHub Repo stars arXiv 2025/05/20 Github Demo
Mogao Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation arXiv 2025/05/08 - -
JanusFlow JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation GitHub Repo stars arXiv 2024/11/12 Github Demo

Any-to-Any Multimodal models

Name Title Venue Date Code Demo
M2-omni M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance arXiv 2025/02/26 - -
OmniFlow OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows GitHub Repo stars CVPR 2024/12/02 Github -
Spider Spider: Any-to-Many Multimodal LLM GitHub Repo stars arXiv 2024/11/14 Github -
MIO MIO: A Foundation Model on Multimodal Tokens GitHub Repo stars arXiv 2024/09/26 Github
X-VILA X-VILA: Cross-Modality Alignment for Large Language Model arXiv 2024/05/29 - -
AnyGPT AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling GitHub Repo stars arXiv 2024/02/19 Github -
Video-LaVIT Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization GitHub Repo stars ICML 2024/02/05 Github -
Unified-IO 2 Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action GitHub Repo stars CVPR 2023/12/28 Github -
NExT-GPT NExT-GPT: Any-to-Any Multimodal LLM GitHub Repo stars ICML 2023/09/11 Github -

Benchmark for Evaluation

Benchmarks on Understanding Tasks

Name Paper Venue Date Code
General-Bench On Path to Multimodal Generalist: General-Level and General-Bench Star ICML 2025/05/07 Github
MM-Vet v2 MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities Star arXiv 2024/08/01 Github
OwlEval mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Star arXiv 2024/04/27 Github
oVQA Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy Star ICLR 2024/02/11 Github
SEED-Bench-2 SEED-Bench-2: Benchmarking Multimodal Large Language Models Star arXiv 2023/11/28 Github
MMMU MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Star CVPR 2023/11/27 Github
MM-Vet MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities Star ICML 2023/08/04 Github
SEED-Bench SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension Star CVPR 2023/07/30 Github
MMBench MMBench: Is Your Multi-modal Model an All-around Player? Star ECCV 2023/07/12 Github
LAMM LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark Star NeurIPS 2023/06/11 Github
HaluEval HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models Star EMNLP 2023/05/19 Github
GQA GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering Star CVPR 2019/02/25 Github
VQA VQA: Visual Question Answering ICCV 2015/05/03 ProjectPage

Benchmarks on Image Generation Tasks

Name Paper Venue Date Code
CompBench CompBench: Benchmarking Complex Instruction-guided Image Editing arxiv 2025/05/18 -
GEdit-Bench Step1X-Edit: A Practical Framework for General Image Editing Star arXiv 2025/04/28 Github
DreamBench++ DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation Star ICLR 2025/03/09 Github
T2I-CompBench++ T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation Star TPAMI 2025/03/08 Github
IE-Bench IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment arXiv 2025/01/17 -
AnyEdit AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea Star CVPR 2024/11/24 Github
I2EBench I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing Star NeurIPS 2024/08/26 Github
ConceptMix ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty Star NeurIPS 2024/08/26 Github
GenAI-Bench GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation Star CVPR 2024/06/19 Github
Commonsense-T2I Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? Star COLM 2024/06/11 Github
HQ-Edit HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing Star ICLR 2024/04/15 Github
VQAScore Evaluating Text-to-Visual Generation with Image-to-Text Generation Star ECCV 2024/04/01 Github
FlashEval FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models Star CVPR 2024/03/25 Github
DPG-Bench ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment Star arxiv 2024/03/08 Github
Reason-Edit SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models Star CVPR 2023/12/11 Github
Emu Edit Emu Edit: Precise Image Editing via Recognition and Generation Tasks CVPR 2023/11/16 HuggingFace
HEIM Holistic Evaluation of Text-To-Image Models Star NeurIPS 2023/11/07 Github
DSG-1k Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation Star ICLR 2023/10/27 Github
GenEval GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment Star NeurIPS 2023/10/17 Github
EditVal EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods Star arXiv 2023/10/03 Github
T2I-CompBench T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation Star NeurIPS 2023/07/12 Github
DreamSim DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data Star NeurIPS 2023/06/15 Github
MagicBrush MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing Star NeurIPS 2023/06/16 Github
MultiGen-20M UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild Star NeurIPS 2023/05/18 Github
HRS-Bench HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models Star ICCV 2023/04/11 Github
TIFA TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering Star ICCV 2023/03/21 Github
EditBench Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting CVPR 2022/12/13 ProjectPage
PartiPrompts Scaling Autoregressive Models for Content-Rich Text-to-Image Generation Star TMLR 2022/06/22 Github
DrawBench Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding NeurIPS 2022/05/23 ProjectPage
PaintSkills DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models Star ICCV 2022/02/08 Github

Benchmarks on Interleaved / Compositional / Other Tasks

Name Paper Venue Date Code
VTBench VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation Star arXiv 2025/05/19 Github
UniBench UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation Star arXiv 2025/05/15 Github
OpenING OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation Star CVPR 2024/11/27 Github
ISG Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment Star ICLR 2024/11/26 Github
MMIE MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models Star ICLR 2024/10/14 Github
InterleavedBench Holistic Evaluation for Interleaved Text-and-Image Generation EMNLP 2024/06/20 HuggingFace
OpenLEAF OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation MM 2023/10/01 -

Dataset

Multimodal Understanding

Dataset Samples Paper Venue Date
Infinity-MM 40M Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data arXiv 2024/10/24
LLaVA-OneVision 4.8M LLaVA-OneVision: Easy Visual Task Transfer TMLR 2024/08/06
Cambrian-10M 10M Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs NeurIPS 2024/06/24
ShareGPT4V 100K Sharegpt4v: Improving large multi-modal models with better captions ECCV 2023/11/21
CapsFusion-120M 120M Capsfusion: Rethinking image-text data at scale CVPR 2023/10/31
GRIT 20M Kosmos-2: Grounding multimodal large language models to the world ICLR 2023/06/26
DataComp 1.4B DATACOMP: In search of the next generation of multimodal datasets NeurIPS 2023/04/27
Laion-COCO 600M Laion coco: 600m synthetic captions from laion2b-en - 2022/09/15
COYO 747M Coyo-700m: Image-text pair dataset - 2022/08/31
Laion 5.9B Laion-5b: An open large-scale dataset for training next generation image-text models NeurIPS 2022/03/31
Wukong 100M Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark NeurIPS 2022/02/14
RedCaps 12M Redcaps: Web-curated image-text data created by the people, for the people NeurIPS 2021/11/22

Text-to-Image

Dataset Samples Paper Venue Date
BLIP3o-60k 60K BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset arXiv 2025/05/14
TextAtlas5M 5M TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation arXiv 2025/02/11
EliGen TrainSet 500K EliGen: Entity-Level Controlled Image Generation with Regional Attention arXiv 2025/01/02
PD12M 12M Public domain 12m: A highly aesthetic image-text dataset with novel governance mechanisms arXiv 2024/10/30
SFHQ-T2I 122K - - 2024/10/06
text-to-image-2M 2M - - 2024/09/13
DenseFusion 1M Densefusion-1m: Merging vision experts for comprehensive multimodal perception NeurIPS 2024/07/11
Megalith 10M - - 2024/07/01
PixelProse 16M From pixels to prose: A large dataset of dense image captions arXiv 2024/06/14
CosmicMan-HQ 1.0 6M Cosmicman: A text-to-image foundation model for humans CVPR 2024/04/01
AnyWord-3M 3M Anytext: Multilingual visual text generation and editing ICLR 2023/11/06
JourneyDB 4M JourneyDB: A Benchmark for Generative Image Understanding NeurIPS 2023/07/03
RenderedText 12M - - 2023/06/30
Mario-10M 10M Textdiffuser: Diffusion models as text painters NeurIPS 2023/05/18
SAM 11M Segment Anything ICCV 2023/04/05
LAION-Aesthetics 120M Laion-5b: An open large-scale dataset for training next generation image-text models NeurIPS 2022/08/16
CC-12M 12M Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts CVPR 2021/02/17

Image Editing

Dataset Samples Paper Venue Date
AnyEdit 2.5M Anyedit: Mastering unified high-quality image editing for any idea CVPR 2024/11/24
OmniEdit 1.2M Omniedit: Building image editing generalist models through specialist supervision ICLR 2024/11/11
UltraEdit 4M Ultraedit: Instruction-based fine-grained image editing at scale NeurIPS 2024/07/07
SEED-Data-Edit 3.7M Seed-data-edit technical report: A hybrid dataset for instructional image editing arXiv 2024/05/07
HQ-Edit 197K Hq-edit: A high-quality dataset for instruction-based image editing arXiv 2024/04/15
Magicbrush 10K Magicbrush: A manually annotated dataset for instruction-guided image editing NeurIPS 2023/06/16
InstructP2P 313K Instructpix2pix: Learning to follow image editing instructions CVPR 2022/11/17

Interleaved Image-Text

Dataset Samples Paper Venue Date
CoMM 227K Comm: A coherent interleaved image-text dataset for multimodal understanding and generation CVPR 2024/06/15
OBELICS 141M Obelics: An open web-scale filtered dataset of interleaved image-text documents NeurIPS 2023/06/21
Multimodal C4 101.2M Multimodal c4: An open, billion-scale corpus of images interleaved with text NeurIPS 2023/04/14

Other Text-Image-to-Image

Dataset Samples Paper Venue Date
SynCD 95K Generating multi-image synthetic data for text-to-image customization arXiv 2025/02/03
X2I-subject-driven 2.5M OmniGen: Unified Image Generation arXiv 2024/12/14
Subjects200K 200K Ominicontrol: Minimal and universal control for diffusion transformer arXiv 2024/11/22
MultiGen-20M 20M Unicontrol: A unified diffusion model for controllable visual generation in the wild NeurIPS 2023/05/18
LAION-Face 50M General facial representation learning in a visual-linguistic manner CVPR 2021/12/06

Applications and Opportunities

Name Title Venue Date Code Demo
UniCTokens UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens GitHub Repo stars arXiv 2025/05/20 Github -
Fair-UMLLM On Fairness of Unified Multimodal Large Language Model for Image Generation arXiv 2025/02/05 - -
T2I-R1 T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT GitHub Repo stars arXiv 2025/01/29 Github -

Citation

If you find this repo is helpful for your research, please cite our paper:

@article{zhang2025unified,
  title={Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities},
  author={Zhang, Xinjie and Guo, Jintao and Zhao, Shanshan and Fu, Minghao and Duan, Lunhao and Wang, Guo-Hua and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu},
  journal={arXiv preprint arXiv:2505.02567},
  year={2025}
}