Skip to content

unitreerobotics/unifolm-world-model-action

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

47 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

UnifoLM-WMA-0: A World-Model-Action (WMA) Framework under UnifoLM Family

Project Page | Models | Dataset

UnifoLM-WMA-0 is Unitreeβ€˜s open-source world-model–action architecture spanning multiple types of robotic embodiments, designed specifically for general-purpose robot learning. Its core component is a world-model capable of understanding the physical interactions between robots and the environments. This world-model provides two key functions: (a) Simulation Engine – operates as an interactive simulator to generate synthetic data for robot learning; (b) Policy Enhancement – connects with an action head and, by predicting future interaction processes with the world-model, further optimizes decision-making performance.

🦾 Real-Robot Demonstrations

Note: the top-right window shows the world model’s pretion of future action videos.

πŸ”₯ News

  • Sep 22, 2025: πŸš€ We released the deployment code for assisting experiments with Unitree robots.
  • Sep 15, 2025: πŸš€ We released the training and inference code along with the model weights of UnifoLM-WMA-0.

πŸ“‘ Opensource Plan

  • Training
  • Inference
  • Checkpoints
  • Deployment

βš™οΈ Installation

conda create -n unifolm-wma python==3.10.18
conda activate unifolm-wma

conda install pinocchio=3.2.0 -c conda-forge -y
conda install ffmpeg=7.1.1 -c conda-forge

git clone --recurse-submodules https://github.com/unitreerobotics/unifolm-world-model-action.git

# If you already downloaded the repo:
cd unifolm-world-model-action
git submodule update --init --recursive

pip install -e .

cd external/dlimp
pip install -e .

🧰 Model Checkpoints

Model Description Link
$\text{UnifoLM-WMA-0}_{Base}$ Fine-tuned on Open-X dataset. HuggingFace
$\text{UnifoLM-WMA-0}_{Dual}$ Fine-tuned on five Unitree opensource dataset in both decision-making and simulation modes. HuggingFace

πŸ›’οΈ Dataset

In our experiments, we consider the following three opensource dataset:

Dataset Robot Link
Z1_StackBox Unitree Z1 Huggingface
Z1_DualArm_StackBox Unitree Z1 Huggingface
Z1_DualArm_StackBox_V2 Unitree Z1 Huggingface
Z1_DualArm_Cleanup_Pencils Unitree Z1 Huggingface
G1_Pack_Camera Unitree G1 Huggingface

To train on your own dataset, first to have the data following the Huggingface LeRobot V2.1 dataset format. Assume the dataset’s source directory structure is as follows:

source_dir/
    β”œβ”€β”€ dataset1_name
    β”œβ”€β”€ dataset2_name
    β”œβ”€β”€ dataset3_name
    └── ...

Then, convert a dataset to the required format using the command below:

cd prepare_data
python prepare_training_data.py \
    --source_dir /path/to/your/source_dir \
    --target_dir /path/to/save/the/converted/data \
    --dataset_name "dataset1_name" \
    --robot_name "a tag of the robot in the dataset" # e.g, Unitree Z1 Robot Arm or Unitree G1 Robot with Gripper.

The resulting data structure (Note: model training only supports input from the main-view camera. If the dataset includes multiple views, remove the corresponding values from the data_dir column in the CSV file.

target_dir/
    β”œβ”€β”€ videos
    β”‚     β”œβ”€β”€dataset1_name
    β”‚     β”‚   β”œβ”€β”€camera_view_dir
    β”‚     β”‚       β”œβ”€β”€ 0.mp4
    β”‚     β”‚       β”œβ”€β”€ 1.mp4
    β”‚     β”‚       └── ...
    β”‚     └── ...
    β”œβ”€β”€ transitions
    β”‚    β”œβ”€β”€ dataset1_name
    β”‚        β”œβ”€β”€ meta_data
    β”‚        β”œβ”€β”€ 0.h5
    β”‚        β”œβ”€β”€ 1.h5
    β”‚        └── ...
    └──  dataset1_name.csv

πŸš΄β€β™‚οΈ Training

A. Our training strategy is outlined as follows:

  • Step 1: Fine-tune a video generation model as the world model using the Open-X dataset;
  • Step 2: Post-train $\text{UnifoLM-WMA}$ in decision-making mode on the downstream task dataset;
  • Step 3: Post-train $\text{UnifoLM-WMA}$ in simulation mode on the downstream task dataset.

Note: If you only require $\text{UnifoLM-WMA}$ to operate in a single mode, you may skip the corresponding step.

B. To conduct training on a single or multiple datasets, please follow the steps below:

  • Step 1: The maximum DoF is assumed to be 16, if you have more than 16 DoF, update agent_state_dim and agent_action_dim in configs/train/config.yaml ;
  • Step 2: Set up the input shapes for each modality in configs/train/meta.json;
  • Step 3: Configure the training parameters in configs/train/config.yaml. For the pretrained_checkpoint, we recommend using the checkpoint " $\text{UnifoLM-WMA-0}_{Base}$ " fine-tuned on the Open-X dataset;
    model:
        pretrained_checkpoint: /path/to/pretrained/checkpoint;
        ...
        decision_making_only: True # Train the world model only in decision-making mode. If False, jointly train it in both decision-making and simulation modes.
        ...
    data:
        ...
        train:
            ...
            data_dir: /path/to/training/dataset/directory
        dataset_and_weights: # list the name of each dataset below and make sure the summation of weights is 1.0
            dataset1_name: 0.2
            dataset2_name: 0.2
            dataset3_name: 0.2
            dataset4_name: 0.2
            dataset5_name: 0.2
  • Step 4: Setup experiment_name, save_root variables in scripts/train.sh;
  • Step 5: Launch the training with the command:
bash scripts/train.sh

🌏 Inference under Interactive Simulation Mode

To run the world model in an interactive simulation mode, follow these steps:

  • Step 1: (Skip this step if you just would like to test using the examples we provided) Prepare your own prompt following the format used in the examples/world_model_interaction_prompts:
    world_model_interaction_prompts/
      β”œβ”€β”€ images
      β”‚    β”œβ”€β”€ dataset1_name
      β”‚    β”‚       β”œβ”€β”€ 0.png     # Image prompt
      β”‚    β”‚       └── ...
      β”‚    └── ...
      β”œβ”€β”€ transitions
      β”‚    β”œβ”€β”€ dataset1_name
      β”‚    β”‚       β”œβ”€β”€ meta_data # Used for normalization
      β”‚    β”‚       β”œβ”€β”€ 0.h       # Robot state and action data; in interaction mode,
      β”‚    β”‚       β”‚             # only used to retrieve the robot state corresponding 
      β”‚    β”‚       β”‚             # to the image prompt
      β”‚    β”‚       └── ...
      β”‚    └── ...
      β”œβ”€β”€  dataset1_name.csv     # File for loading image prompts, text instruction and corresponding robot states
      └── ...
    
  • Step 2: Specify the correct paths for pretrained_checkpoint(e.g, $\text{UnifoLM-WMA-0}_{Dual}$) and data_dir in configs/inference/world_model_interaction.yaml
  • Step 3: Set the paths for checkpoint, res_dir and prompt_dir in scripts/run_world_model_interaction.sh, and specify all the dataset's name in datasets=(...). Then, launch the inference with the command:
    bash scripts/run_world_model_interaction.sh
    

🧠 Inference and Deployment under Decision-Making Mode

In this setup, inference is performed on a server, while a robot client gathers observations from the real-robot and sends them to the server to query actions. The process unfolds through the following steps:

Server Setup:

conda activate unifolm-wma
cd unifolm-world-model-action
bash scripts/run_real_eval_server.sh

Client Setup

  • Step-1: Follow the instructions in unitree_deploy/README.md to create the unitree_deploy conda environment, install the required packages, launch the controllers or services on the real-robot.
  • Step-2: Open a new terminal and establish a tunnel connection from the client to the server:
ssh user_name@remote_server_IP -CNg -L 8000:127.0.0.1:8000
  • Step-3: Run the unitree_deploy/robot_client.py script to start inference:
cd unitree_deploy
python scripts/robot_client.py --robot_type "g1_dex1" --action_horizon 16 --exe_steps 16 --observation_horizon 2 --language_instruction "pack black camera into box" --output_dir ./results --control_freq 15

πŸ“ Codebase Architecture

Here's a high-level overview of the project's code structure and core components:

unitree-world-model/
    β”œβ”€β”€ assets                      # Media assets such as GIFs, images, and demo videos
    β”œβ”€β”€ configs                     # Configuration files for training and inference
    β”‚    β”œβ”€β”€ inference
    β”‚    └──  train
    β”œβ”€β”€ examples                    # Example inputs and prompts for running inference
    β”œβ”€β”€ external                    # External packages
    β”œβ”€β”€ prepare_data                # Scripts for dataset preprocessing and format conversion
    β”œβ”€β”€ scripts                     # Main scripts for training, evaluation, and deployment
    β”œβ”€β”€ src
    β”‚    β”œβ”€β”€unitree_worldmodel      # Core Python package for the Unitree world model
    β”‚    β”‚      β”œβ”€β”€ data            # Dataset loading, transformations, and dataloaders
    β”‚    β”‚      β”œβ”€β”€ models          # Model architectures and backbone definitions
    β”‚    β”‚      β”œβ”€β”€ modules         # Custom model modules and components
    β”‚    β”‚      └──  utils          # Utility functions and common helpers
    └── unitree_deploy              # Deployment code

πŸ™ Acknowledgement

Lots of code are inherited from DynamiCrafter, Diffusion Policy, ACT and HPT.

πŸ“ Citation

@misc{unifolm-wma-0,
  author       = {Unitree},
  title        = {UnifoLM-WMA-0: A World-Model-Action (WMA) Framework under UnifoLM Family},
  year         = {2025},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •