Awesome Unified Multimodal Models

Figure 1: Timeline of Publicly Available and Unavailable Unified Multimodal Models. The models are categorized by their release years, from 2023 to 2025. Models underlined in the diagram represent any-to-any multimodal models, capable of handling inputs or outputs beyond text and image, such as audio, video, and speech. The timeline highlights the rapid growth in this field.

🔥 We Are Hiring!

We are looking for both interns and full-time researchers to join our team, focusing on multimodal understanding, generation, reasoning, AI agents, and unified multimodal models. If you are interested in exploring these exciting areas, please reach out to us at [email protected].

👉 What is This Repo for?

This repository provides a comprehensive collection of resources related to unified multimodal models, featuring:

A survey of advances, challenges, and timelines for unified models
Categorized lists of diffusion-based, autoregressive (MLLM), and hybrid architectures for unified image–text understanding and generation
Benchmarks for evaluating multimodal comprehension, image generation, and interleaved image–text tasks
Representative datasets covering multimodal understanding, text-to-image synthesis, image editing, and interleaved interactions

Designed to help researchers and practitioners explore, compare, and build state-of-the-art unified multimodal systems.

Text-and-Image Unified Models

Figure 2: Classification of Unified Multimodal Understanding and Generation Models. The models are divided into three main categories based on their backbone architecture: Diffusion, MLLM (AR), and MLLM (AR + Diffusion). Each category is further subdivided according to the encoding strategy employed, including Pixel Encoding, Semantic Encoding, Learnable Query Encoding, and Hybrid Encoding. We illustrate the architectural variations within these categories and their corresponding encoder-decoder configurations.

Diffusion

Name	Title	Venue	Date	Code	Demo
MMaDA	MMaDA: Multimodal Large Diffusion Language Models	arXiv	2025/05/22	Github	Demo
Dual Diffusion	Dual Diffusion for Unified Image Generation and Understanding	arXiv	2024/12/31	Github	-

MLLM AR

b-1: Pixel Encoding

Name	Title	Venue	Date	Code	Demo
Selftok	Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning	arXiv	2025/05/12	Github	-
TokLIP	TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation	arXiv	2025/05/08	Github	-
Harmon	Harmonizing Visual Representations for Unified Multimodal Understanding and Generation	arXiv	2025/03/27	Github	Demo
UGen	UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning	arXiv	2025/03/27	-	-
SynerGen-VL	SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding	arXiv	2024/12/12	-	-
Liquid	Liquid: Language Models are Scalable and Unified Multi-modal Generators	arXiv	2024/12/05	Github	Demo
Orthus	Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads	arXiv	2024/11/28	Github	-
MMAR	MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling	arXiv	2024/10/14	-	-
Emu3	Emu3: Next-Token Prediction is All You Need	arXiv	2024/09/27	Github	Demo
ANOLE	ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation	arXiv	2024/07/08	Github	-
Chameleon	Chameleon: Mixed-Modal Early-Fusion Foundation Models	arXiv	2024/05/16	Github	-
LWM	World Model on Million-Length Video And Language With Blockwise RingAttention	ICLR	2024/02/13	Github	-

b-2: Semantic Encoding

Name	Title	Venue	Date	Code	Demo
BLIP3-o	BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset	arXiv	2025/05/14	Github	-
UniTok	UniTok: A Unified Tokenizer for Visual Generation and Understanding	arXiv	2025/02/27	Github	Demo
MetaMorph	MetaMorph: Multimodal Understanding and Generation via Instruction Tuning	arXiv	2024/12/18	Github	-
ILLUME	ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance	arXiv	2024/12/09	-	-
PUMA	PUMA: Empowering Unified MLLM with Multi-granular Visual Generation	arXiv	2024/10/17	Github	-
VILA-U	VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	ICLR	2024/09/06	Github	Demo
Mini-Gemini	Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	arXiv	2024/03/27	Github	Demo
MM-Interleaved	MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer	arXiv	2024/01/18	Github	-
VL-GPT	VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	arXiv	2023/12/14	Github	-
Emu2	Generative Multimodal Models are In-Context Learners	CVPR	2023/12/10	Github	Demo
DreamLLM	DreamLLM: Synergistic Multimodal Comprehension and Creation	ICLR	2023/09/20	Github	-
LaVIT	Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	ICLR	2023/09/09	Github	-
Emu	Emu: Generative Pretraining in Multimodality	ICLR	2023/07/11	Github	Demo

b-3: Learnable Query Encoding

Name	Title	Venue	Date	Code	Demo
Ming-Lite-Uni	Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction	arXiv	2025/05/05	Github	-
Nexus-Gen	Nexus-Gen: A Unified Model for Image Understanding, Generation, and Editing	arXiv	2025/04/30	Github	Demo
MetaQueries	Transfer between Modalities with MetaQueries	arXiv	2025/04/08	-	-
SEED-X	SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation	arXiv	2024/04/22	Github	Demo
SEED-LLaMA	Making LLaMA SEE and Draw with SEED Tokenizer	ICLR	2023/10/02	Github	Demo
SEED	Planting a SEED of Vision in Large Language Model	arXiv	2023/07/16	Github	Demo

b-4: Hybrid Encoding (Pseduo)

Name	Title	Venue	Date	Code	Demo
UniFluid	Unified Autoregressive Visual Generation and Understanding with Continuous Tokens	arXiv	2025/03/17	-	-
OmniMamba	OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models	arXiv	2025/03/11	Github	-
Janus-Pro	Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling	arXiv	2025/01/29	Github	Demo
Janus	Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation	arXiv	2024/10/17	Github	Demo

b-5: Hybrid Encoding (Joint)

Name	Title	Venue	Date	Code	Demo
UniToken	UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding	CVPRW	2025/04/06	Github	-
VARGPT-v1.1	VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning	arXiv	2025/04/03	Github	-
ILLUME+	ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement	arXiv	2025/04/02	Github	-
DualToken	DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies	arXiv	2025/03/18	Github	-
VARGPT	VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model	arXiv	2025/01/21	Github	-
TokenFlow	TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation	CVPR	2024/12/04	Github	-
MUSE-VL	MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding	arXiv	2024/11/26	-	-

MLLM AR-Diffusion

c-1: Pxiel Encoding

Name	Title	Venue	Date	Code	Demo
LMFusion	LMFusion: Adapting Pretrained Language Models for Multimodal Generation	arXiv	2024/12/19	-	-
MonoFormer	MonoFormer: One Transformer for Both Diffusion and Autoregression	arXiv	2024/09/24	Github	-
Show-o	Show-o: One Single Transformer to Unify Multimodal Understanding and Generation	ICLR	2024/08/22	Github	Demo
Transfusion	Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model	ICLR	2024/08/20	Github	-

c-2: Hybrid Encoding (Pseduo)

Name	Title	Venue	Date	Code	Demo
BAGEL	Emerging Properties in Unified Multimodal Pretraining	arXiv	2025/05/20	Github	Demo
Mogao	Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation	arXiv	2025/05/08	-	-
JanusFlow	JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation	arXiv	2024/11/12	Github	Demo

Any-to-Any Multimodal models

Name	Title	Venue	Date	Code	Demo
M2-omni	M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance	arXiv	2025/02/26	-	-
OmniFlow	OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows	CVPR	2024/12/02	Github	-
Spider	Spider: Any-to-Many Multimodal LLM	arXiv	2024/11/14	Github	-
MIO	MIO: A Foundation Model on Multimodal Tokens	arXiv	2024/09/26	Github
X-VILA	X-VILA: Cross-Modality Alignment for Large Language Model	arXiv	2024/05/29	-	-
AnyGPT	AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	arXiv	2024/02/19	Github	-
Video-LaVIT	Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	ICML	2024/02/05	Github	-
Unified-IO 2	Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	CVPR	2023/12/28	Github	-
NExT-GPT	NExT-GPT: Any-to-Any Multimodal LLM	ICML	2023/09/11	Github	-

Benchmark for Evaluation

Benchmarks on Understanding Tasks

Name	Paper	Venue	Date	Code
General-Bench	On Path to Multimodal Generalist: General-Level and General-Bench	ICML	2025/05/07	Github
MM-Vet v2	MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities	arXiv	2024/08/01	Github
OwlEval	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	arXiv	2024/04/27	Github
oVQA	Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy	ICLR	2024/02/11	Github
SEED-Bench-2	SEED-Bench-2: Benchmarking Multimodal Large Language Models	arXiv	2023/11/28	Github
MMMU	MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI	CVPR	2023/11/27	Github
MM-Vet	MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities	ICML	2023/08/04	Github
SEED-Bench	SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	CVPR	2023/07/30	Github
MMBench	MMBench: Is Your Multi-modal Model an All-around Player?	ECCV	2023/07/12	Github
LAMM	LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	NeurIPS	2023/06/11	Github
HaluEval	HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models	EMNLP	2023/05/19	Github
GQA	GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering	CVPR	2019/02/25	Github
VQA	VQA: Visual Question Answering	ICCV	2015/05/03	ProjectPage

Benchmarks on Image Generation Tasks

Name	Paper	Venue	Date	Code
CompBench	CompBench: Benchmarking Complex Instruction-guided Image Editing	arxiv	2025/05/18	-
GEdit-Bench	Step1X-Edit: A Practical Framework for General Image Editing	arXiv	2025/04/28	Github
DreamBench++	DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation	ICLR	2025/03/09	Github
T2I-CompBench++	T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation	TPAMI	2025/03/08	Github
IE-Bench	IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment	arXiv	2025/01/17	-
AnyEdit	AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea	CVPR	2024/11/24	Github
I2EBench	I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing	NeurIPS	2024/08/26	Github
ConceptMix	ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty	NeurIPS	2024/08/26	Github
GenAI-Bench	GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation	CVPR	2024/06/19	Github
Commonsense-T2I	Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?	COLM	2024/06/11	Github
HQ-Edit	HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing	ICLR	2024/04/15	Github
VQAScore	Evaluating Text-to-Visual Generation with Image-to-Text Generation	ECCV	2024/04/01	Github
FlashEval	FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models	CVPR	2024/03/25	Github
DPG-Bench	ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	arxiv	2024/03/08	Github
Reason-Edit	SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models	CVPR	2023/12/11	Github
Emu Edit	Emu Edit: Precise Image Editing via Recognition and Generation Tasks	CVPR	2023/11/16	HuggingFace
HEIM	Holistic Evaluation of Text-To-Image Models	NeurIPS	2023/11/07	Github
DSG-1k	Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation	ICLR	2023/10/27	Github
GenEval	GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment	NeurIPS	2023/10/17	Github
EditVal	EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods	arXiv	2023/10/03	Github
T2I-CompBench	T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation	NeurIPS	2023/07/12	Github
DreamSim	DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data	NeurIPS	2023/06/15	Github
MagicBrush	MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing	NeurIPS	2023/06/16	Github
MultiGen-20M	UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild	NeurIPS	2023/05/18	Github
HRS-Bench	HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models	ICCV	2023/04/11	Github
TIFA	TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering	ICCV	2023/03/21	Github
EditBench	Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting	CVPR	2022/12/13	ProjectPage
PartiPrompts	Scaling Autoregressive Models for Content-Rich Text-to-Image Generation	TMLR	2022/06/22	Github
DrawBench	Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding	NeurIPS	2022/05/23	ProjectPage
PaintSkills	DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models	ICCV	2022/02/08	Github

Benchmarks on Interleaved / Compositional / Other Tasks

Name	Paper	Venue	Date	Code
VTBench	VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation	arXiv	2025/05/19	Github
UniBench	UniEval: Unified Holistic Evaluation for Unified Multimodal Understanding and Generation	arXiv	2025/05/15	Github
OpenING	OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation	CVPR	2024/11/27	Github
ISG	Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment	ICLR	2024/11/26	Github
MMIE	MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models	ICLR	2024/10/14	Github
InterleavedBench	Holistic Evaluation for Interleaved Text-and-Image Generation	EMNLP	2024/06/20	HuggingFace
OpenLEAF	OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation	MM	2023/10/01	-

Dataset

Multimodal Understanding

Dataset	Samples	Paper	Venue	Date
Infinity-MM	40M	Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data	arXiv	2024/10/24
LLaVA-OneVision	4.8M	LLaVA-OneVision: Easy Visual Task Transfer	TMLR	2024/08/06
Cambrian-10M	10M	Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs	NeurIPS	2024/06/24
ShareGPT4V	100K	Sharegpt4v: Improving large multi-modal models with better captions	ECCV	2023/11/21
CapsFusion-120M	120M	Capsfusion: Rethinking image-text data at scale	CVPR	2023/10/31
GRIT	20M	Kosmos-2: Grounding multimodal large language models to the world	ICLR	2023/06/26
DataComp	1.4B	DATACOMP: In search of the next generation of multimodal datasets	NeurIPS	2023/04/27
Laion-COCO	600M	Laion coco: 600m synthetic captions from laion2b-en	-	2022/09/15
COYO	747M	Coyo-700m: Image-text pair dataset	-	2022/08/31
Laion	5.9B	Laion-5b: An open large-scale dataset for training next generation image-text models	NeurIPS	2022/03/31
Wukong	100M	Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark	NeurIPS	2022/02/14
RedCaps	12M	Redcaps: Web-curated image-text data created by the people, for the people	NeurIPS	2021/11/22

Text-to-Image

Dataset	Samples	Paper	Venue	Date
BLIP3o-60k	60K	BLIP3-o: A Family of Fully Open Unified Multimodal Models—Architecture, Training and Dataset	arXiv	2025/05/14
TextAtlas5M	5M	TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation	arXiv	2025/02/11
EliGen TrainSet	500K	EliGen: Entity-Level Controlled Image Generation with Regional Attention	arXiv	2025/01/02
PD12M	12M	Public domain 12m: A highly aesthetic image-text dataset with novel governance mechanisms	arXiv	2024/10/30
SFHQ-T2I	122K	-	-	2024/10/06
text-to-image-2M	2M	-	-	2024/09/13
DenseFusion	1M	Densefusion-1m: Merging vision experts for comprehensive multimodal perception	NeurIPS	2024/07/11
Megalith	10M	-	-	2024/07/01
PixelProse	16M	From pixels to prose: A large dataset of dense image captions	arXiv	2024/06/14
CosmicMan-HQ 1.0	6M	Cosmicman: A text-to-image foundation model for humans	CVPR	2024/04/01
AnyWord-3M	3M	Anytext: Multilingual visual text generation and editing	ICLR	2023/11/06
JourneyDB	4M	JourneyDB: A Benchmark for Generative Image Understanding	NeurIPS	2023/07/03
RenderedText	12M	-	-	2023/06/30
Mario-10M	10M	Textdiffuser: Diffusion models as text painters	NeurIPS	2023/05/18
SAM	11M	Segment Anything	ICCV	2023/04/05
LAION-Aesthetics	120M	Laion-5b: An open large-scale dataset for training next generation image-text models	NeurIPS	2022/08/16
CC-12M	12M	Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts	CVPR	2021/02/17

Image Editing

Dataset	Samples	Paper	Venue	Date
AnyEdit	2.5M	Anyedit: Mastering unified high-quality image editing for any idea	CVPR	2024/11/24
OmniEdit	1.2M	Omniedit: Building image editing generalist models through specialist supervision	ICLR	2024/11/11
UltraEdit	4M	Ultraedit: Instruction-based fine-grained image editing at scale	NeurIPS	2024/07/07
SEED-Data-Edit	3.7M	Seed-data-edit technical report: A hybrid dataset for instructional image editing	arXiv	2024/05/07
HQ-Edit	197K	Hq-edit: A high-quality dataset for instruction-based image editing	arXiv	2024/04/15
Magicbrush	10K	Magicbrush: A manually annotated dataset for instruction-guided image editing	NeurIPS	2023/06/16
InstructP2P	313K	Instructpix2pix: Learning to follow image editing instructions	CVPR	2022/11/17

Interleaved Image-Text

Dataset	Samples	Paper	Venue	Date
CoMM	227K	Comm: A coherent interleaved image-text dataset for multimodal understanding and generation	CVPR	2024/06/15
OBELICS	141M	Obelics: An open web-scale filtered dataset of interleaved image-text documents	NeurIPS	2023/06/21
Multimodal C4	101.2M	Multimodal c4: An open, billion-scale corpus of images interleaved with text	NeurIPS	2023/04/14

Other Text-Image-to-Image

Dataset	Samples	Paper	Venue	Date
SynCD	95K	Generating multi-image synthetic data for text-to-image customization	arXiv	2025/02/03
X2I-subject-driven	2.5M	OmniGen: Unified Image Generation	arXiv	2024/12/14
Subjects200K	200K	Ominicontrol: Minimal and universal control for diffusion transformer	arXiv	2024/11/22
MultiGen-20M	20M	Unicontrol: A unified diffusion model for controllable visual generation in the wild	NeurIPS	2023/05/18
LAION-Face	50M	General facial representation learning in a visual-linguistic manner	CVPR	2021/12/06

Applications and Opportunities

Name	Title	Venue	Date	Code	Demo
UniCTokens	UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens	arXiv	2025/05/20	Github	-
Fair-UMLLM	On Fairness of Unified Multimodal Large Language Model for Image Generation	arXiv	2025/02/05	-	-
T2I-R1	T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT	arXiv	2025/01/29	Github	-

Citation

If you find this repo is helpful for your research, please cite our paper:

@article{zhang2025unified,
  title={Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities},
  author={Zhang, Xinjie and Guo, Jintao and Zhao, Shanshan and Fu, Minghao and Duan, Lunhao and Wang, Guo-Hua and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu},
  journal={arXiv preprint arXiv:2505.02567},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome Unified Multimodal Models

🔥 We Are Hiring!

👉 What is This Repo for?