LUMOS: Language-Conditioned Imitation Learning with World Models

scroll down

Abstract

We introduce LUMOS, a language-conditioned multi-task imitation learning framework for robotics. LUMOS learns skills by practicing them over many long-horizon rollouts in the latent space of a learned world model and transfers these skills zero-shot to a real robot. By learning on-policy in the latent space of the learned world model, our algorithm mitigates policy-induced distribution shift which most offline imitation learning methods suffer from. LUMOS learns from unstructured play data with fewer than 1% hindsight language annotations but is steerable with language commands at test time. We achieve this coherent long-horizon performance by combining latent planning with both image- and language-based hindsight goal relabeling during training, and by optimizing an intrinsic reward defined in the latent space of the world model over multiple time steps, effectively reducing covariate shift. In experiments on the difficult long-horizon CALVIN benchmark, LUMOS outperforms prior learning-based methods with comparable approaches on chained multi-task evaluations. To the best of our knowledge, we are the first to learn a language-conditioned continuous visuomotor control for a real-world robot within an offline world model.

Figure 1. Overview of language-conditioned multi-task imitation learning via LUMOS.

Approach

Training consists of two phases:

1. A world model is learned from the unlabeled play dataset.

2. An actor-critic agent is trained within this learned world model to acquire a goal-conditioned policy by guiding imagined sequences of latent model states to match the latent trajectory of the expert demonstrations.

During inference, the language-conditioned policy, trained entirely in the latent space of the world model, successfully transfers to the real environment.

LUMOS Structure
Figure 2: LUMOS learns a language-guided general-purpose policy within the latent space of a world model. (1) The world model, comprising an image encoder, a Recurrent State-Space Model (RSSM) for dynamics, and an image decoder, transforms play dataset experience into a predictive model that enables behavior learning in the latent state space. (2) The goal-conditioned policy samples latent trajectories and uses either a language annotation or the final latent state as the goal, with plan recognition and proposal networks being trained to identify and organize behaviors in a latent plan space. The action decoder is intrinsically rewarded by matching the expert's latent trajectory. (3) During inference, the policy acts based on the latent state inferred by the world model from the current observation and is guided by a user's language command.

Experiments

In the real-world experiment, we test LUMOS on real-world robotic tasks using a Franka Emika Panda arm in a 3D tabletop environment. The goal is to evaluate how well the robot understands and executes object manipulation tasks based on language instructions.

We collected a three-hour dataset by teleoperating the robot with a VR controller, guiding it through interactions with objects like a stove, a bowl, a cabinet, a carrot, and an eggplant. The robot's movements were recorded using RGB cameras, and only a small fraction of the data (less than 1%) was annotated with language descriptions.

Figure 3. Teleoperating a robot with a VR controller to collect a play dataset. Recorded with RGB cameras for real-world training.

We evaluated LUMOS’ ability to complete five-step instruction sequences, measuring how well it followed language commands on multi-stage long-horizon manipulation tasks.



The robot successfully follows instructions and completes multi-step tasks, showing adaptability and error recovery. Performance varies with task complexity, indicating that more training data could enhance robustness, bringing us closer to more advanced, language-guided robotic assistants.

Dataset

Our real-world play dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. If you use the data in an academic context, please consider citing our paper.

LUMOS Dataset
Figure 4: Visualization of the real-world environment observations across three time steps. At each time step, the top row displays the RGB image captured by the static camera, while the bottom row presents the RGB image from the gripper-mounted camera.
Download LUMOS Dataset

Code

The source code for this project is available on our GitHub repository for academic usage and is released under the MIT license.

Publications

LUMOS: Language-Conditioned Imitation Learning with World Models
Iman Nematollahi, Branton DeMoss, Akshay L Chandra, Nick Hawes, Wolfram Burgard, Ingmar Posner
Pdf
                
LUMOS: Language-Conditioned Imitation Learning with World Models Iman Nematollahi, Branton DeMoss, Akshay L Chandra, Nick Hawes, Wolfram Burgard, Ingmar Posner Pdf

BibTeX citation

                
@inproceedings{nematollahi25icra,
author  = {Iman Nematollahi and Branton DeMoss and Akshay L Chandra and Nick Hawes and Wolfram Burgard and Ingmar Posner},
title   = {LUMOS: Language-Conditioned Imitation Learning with World Models},
booktitle = {Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
year    = {2025},
url={http://ais.informatik.uni-freiburg.de/publications/papers/nematollahi25icra.pdf},
address = {Atlanta, USA}
}