RoboBrain 2.0: Advanced version of RoboBrain. See Better. Think Harder. Do Smarter. 🎉🎉🎉
-
Updated
Aug 25, 2025 - Python
RoboBrain 2.0: Advanced version of RoboBrain. See Better. Think Harder. Do Smarter. 🎉🎉🎉
Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
UI-Venus is a native UI agent designed to perform precise GUI element grounding and effective navigation using only screenshots as input.
[ACL 2025] The code repository for "Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning" in PyTorch.
Toward Universal Multimodal Embedding
Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖
Official implementation of CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation
Multi-Modal Healthcare Assistant
ElaMath is a smart, voice-enabled math assistant that helps students solve and understand math problems using both spoken questions and images. It’s powered by the powerful multimodal meta-llama/llama-4-scout-17b-16e-instruct model via Groq API, combined with Whisper for speech recognition and ElevenLabs/gTTS for natural voice responses.
Elarova — A smart, multimodal research assistant designed to help students by combining speech, text, and other input modes for efficient academic research and study support. Powered by state-of-the-art speech recognition, text-to-speech, and AI models, including meta-llama/llama-4-scout-17b-16e-instruct, with an easy-to-use Gradio web interface.
This repo contains integration of LangChain with Google Gemini LLM
Add a description, image, and links to the multimodel-large-language-model topic page so that developers can more easily learn about it.
To associate your repository with the multimodel-large-language-model topic, visit your repo's landing page and select "manage topics."