InternVLA-M1 is a Open-source Spatially Grounded Foundation Model for Generalist Robot Policy
demo_mini.mp4
-
Modular & Extensible
All core components (model architecture, training data, training strategies, evaluation pipeline) are fully decoupled, enabling independent development, debugging, and extension of each module. -
Dual-System and Dual-Supervision InternVLA-M1 integrates both a language head and an action head under a unified framework, enabling collaborative training with dual supervision.
-
Efficient Training & Fast Convergence Learns spatial and visual priors from large-scale multimodal pretraining and transfers them via spatial prompt fine-tuning. Achieves strong performance (e.g., SOTA-level convergence on in ~2.5 epochs without separate action pretraining).
- Users who want to leverage open-source VLMs (e.g., Qwen2.5-VL) for robot control.
- Teams co-training action datasets jointly with multimodal (visionβlanguage) data.
- Researchers exploring alternative VLA architectures and training strategies.
WindowX | Google Robot(VA) | Google Robot(VM) | LIBERO | |
---|---|---|---|---|
27.1 | 54.8 | 58.8 | 94.2 | |
GR00t | 61.9 | 44.5 | 35.2 | 93.9 |
InternVLA-M1 | 71.7 | 76.0 | 80.7 | 95.9 |
# Clone the repo
git clone https://github.com/InternRobotics/InternVLA-M1
# Create conda environment
conda create -n internvla-m1 python=3.10 -y
conda activate internvla-m1
# Install requirements
pip install -r requirements.txt
# Install FlashAttention2
pip install flash-attn --no-build-isolation
# Install InternVLA-M1
pip install -e .
Below are two collapsible examples: InternVLA-M1 chat and action prediction.
InternVLA-M1 Chat Demo (image Q&A / Spatial Grounding)
from InternVLA.model.framework.M1 import InternVLA_M1
from PIL import Image
import requests
from io import BytesIO
import torch
def load_image_from_url(url: str) -> Image.Image:
resp = requests.get(url, timeout=15)
resp.raise_for_status()
img = Image.open(BytesIO(resp.content)).convert("RGB")
return img
saved_model_path = "/PATH/checkpoints/steps_50000_pytorch_model.pt"
internVLA_M1 = InternVLA_M1.from_pretrained(saved_model_path)
# Use the raw image link for direct download
image_url = "https://raw.githubusercontent.com/InternRobotics/InternVLA-M1/InternVLA-M1/assets/table.jpeg"
image = load_image_from_url(image_url)
question = "Give the bounding box for the apple."
response = internVLA_M1.chat_with_M1(image, question)
print(response)
InternVLA-M1 Action Prediction Demo (two views)
from InternVLA.model.framework.M1 import InternVLA_M1
from PIL import Image
import requests
from io import BytesIO
import torch
def load_image_from_url(url: str) -> Image.Image:
resp = requests.get(url, timeout=15)
resp.raise_for_status()
img = Image.open(BytesIO(resp.content)).convert("RGB")
return img
saved_model_path = "/PATH/checkpoints/steps_50000_pytorch_model.pt"
internVLA_M1 = InternVLA_M1.from_pretrained(saved_model_path)
image_url = "https://raw.githubusercontent.com/InternRobotics/InternVLA-M1/InternVLA-M1/assets/table.jpeg"
view1 = load_image_from_url(image_url)
view2 = view1.copy()
# Construct input: batch size = 1, two views
batch_images = [[view1, view2]] # List[List[PIL.Image]]
instructions = ["Pick up the apple and place it on the plate."]
if torch.cuda.is_available():
internVLA_M1 = internVLA_M1.to("cuda")
pred = internVLA_M1.predict_action(
batch_images=batch_images,
instructions=instructions,
cfg_scale=1.5,
use_ddim=True,
num_ddim_steps=10,
)
normalized_actions = pred["normalized_actions"] # [B, T, action_dim]
print(normalized_actions.shape, type(normalized_actions))
We provide several end-to-end examples for reference:
-
Reproduce InternVLA-M1 in simplerEnv Example
-
Training/Deployment on real robots Example
-
Extending InternVLA-M1 Example
We release a series of pretrained models and checkpoints to facilitate reproduction and downstream use.
Model | Description | Link |
---|---|---|
InternVLA-M1 | Main pretrained model | π€ Hugging Face |
InternVLA-M1-Pretrain-RT-1-Bridge | Pretraining on RT-1 Bridge data | π€ Hugging Face |
InternVLA-M1-LIBERO-Long | Fine-tuned on LIBERO Long-horizon tasks | π€ Hugging Face |
InternVLA-M1-LIBERO-Goal | Fine-tuned on LIBERO Goal-conditioned tasks | π€ Hugging Face |
InternVLA-M1-LIBERO-Spatial | Fine-tuned on LIBERO Spatial reasoning tasks | π€ Hugging Face |
InternVLA-M1-LIBERO-Object | Fine-tuned on LIBERO Object-centric tasks | π€ Hugging Face |
- [β ] Release model weights
- Add multi-task mixed training examples
- Unify evaluation scripts and metrics
We welcome contributions via Pull Requests or Issues. Please include detailed logs and reproduction steps when reporting bugs.
If you find this useful in your research, please consider citing:
@misc{internvla2024,
title = {InternVLA-M1: Latent Spatial Grounding for Instruction-Following Robotic Manipulation},
author = {InternVLA-M1 Contributors},
year = {2025},
booktitle={arXiv},
}
- Issues: Submit via GitHub Issues with detailed logs and steps
We thank the open-source community for their inspiring work. This project builds upon and is inspired by the following projects (alphabetical order):
- IPEC-COMMUNITY: Curated OXE / LIBERO style multi-task datasets and formatting examples.
- Isaac-GR00T: Standardized action data loader (GR00T-LeRobot).
- Qwen2.5-VL: Multimodal input/output format, data loader, and pretrained VLM backbone.
- CogACT: Reference for a DiT-style action head design.
- llavavla: Baseline code structure and engineering design references.
- GenManip Simulation Platform: Simulation platform for generalizable pick-and-place based on Isaac Sim.
Notes:
- If any required attribution or license header is missing, please open an issue and we will correct it promptly.
- All third-party resources remain under their original licenses; users should comply with respective terms.
Thanks for using InternVLA-M1! π If you find it useful, please consider giving us a β on GitHub.