A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN. Technique was originally created by https://twitter.com/advadnoun
-
Updated
Feb 6, 2022 - Python
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN. Technique was originally created by https://twitter.com/advadnoun
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
A Comparative Framework for Multimodal Recommender Systems
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"
Automated modeling and machine learning framework FEDOT
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
LLM2CLIP makes SOTA pretrained CLIP model more SOTA ever.
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
X-VLM: Multi-Grained Vision Language Pre-Training (ICML 2022)
A CLI tool/python module for generating images from text using guided diffusion and CLIP from OpenAI.
Towards Generalist Biomedical AI
A knowledge base construction engine for richly formatted data
DANCE: a deep learning library and benchmark platform for single-cell analysis
[ICLR 2025] This is the official repository of our paper "MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine“
An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images
Attention-based multimodal fusion for sentiment analysis
My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"
Add a description, image, and links to the multimodality topic page so that developers can more easily learn about it.
To associate your repository with the multimodality topic, visit your repo's landing page and select "manage topics."