Less is More 🍋: Scalable Visual Navigation from Limited Data

1ETH Zurich 2Stanford University 3UC Berkeley
Model and Dataset will be released soon.

Abstract

Imitation learning provides a powerful framework for goal-conditioned visual navigation in mobile robots, enabling obstacle avoidance while respecting human preferences and social norms. However, its effectiveness depends critically on the quality and diversity of training data. In this work, we show how classical geometric planners can be leveraged to generate syn- thetic trajectories that complement costly human demonstrations. We train Less is More (LiMo), a transformer-based visual navigation policy that predicts goal-conditioned SE(2) trajectories from a single RGB observation, and find that augmenting limited expert demonstrations with planner-generated supervi- sion yields substantial performance gains. Through ablations and complementary qualitative and quantitative analyses, we characterize how dataset scale and diversity affect planning performance. We demonstrate real-robot deployment and argue that robust visual navigation is enabled not by simply collecting more demonstrations, but by strategically curating diverse, high- quality datasets. Our results suggest that scalable, embodiment- specific geometric supervision is a practical path toward data- efficient visual navigation.

Dataset Curation

The use of a geometric planner enables the automatic, scalable generation of diverse expert demonstrations. While we use the real-world robot path walked during dataset collection, we also sample 10 random goals in front of the robot and use a geometric path planner to annotate paths based on co-registered elevation maps.

Policy Architecture

LiMo takes a single RGB image and a robot-centric goal pose (x,y,θ) as input. Image features are extracted with a frozen DINOv2 encoder and combined with learned positional embeddings. A transformer decoder, conditioned on the goal embedding, predicts a sequence of waypoint embeddings, which are linearly projected to N robot-centric waypoints (x,y,θ) forming the output trajectory.

Schematic overview of the policy architecture

Qualitative Performance and Embodied Behavior

LiMo demonstrates strong geometric understanding and embodiment-aware navigation behavior. The policy successfully plans feasible trajectories through complex environments, staircases, and rough natural terrain. By leveraging scalable, automatically-generated geometric demonstrations during training, LiMo develops a sophisticated understanding of the underlying scene geometry and embodiment, enabling it to plan trajectories that are not only geometrically feasible but also aligned with ANYmal's specific locomotion capabilities and physical constraints.

The videos below show the predictions of LiMo on GrandTour missions.

Deployment

We deploy LiMo in closed-loop on an ANYmal D quadruped robot in real-world environments not part of the training data. The policy runs on a NVIDIA Jetson Orin on-board the robot at 6 Hz, using purely vision-based inputs to generate collision-free local trajectories. A simple lookahead path follower node tracks the predicted waypoints.

BibTeX

@misc{inglin2026morescalablevisualnavigation,
      title={Less Is More: Scalable Visual Navigation from Limited Data},
      author={Yves Inglin and Jonas Frey and Changan Chen and Marco Hutter},
      year={2026},
      eprint={2601.17815},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.17815},
}