VLD: Visual Language Goal Distance for Reinforcement Learning Navigation

1ETH Zurich,  2EPFL,  3Stanford University,  4UC Berkeley
ArXiv Preprint, 2025

Example of a successful Gibson navigation episode driven purely by VLD distance predictions.

Abstract

Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning.

Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor.

At deployment, the policy consumes VLD predictions, inheriting semantic goal information—“where to go”—from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation while supporting flexible goal modalities, providing a scalable path toward reliable, multimodal navigation policies.

VLD training and deployment overview

Training (Stage A): we separately train a temporal Vision-Language Distance (VLD) function on diverse real-world and synthetic video datasets, and an RL navigation policy in simulation using geometric distance-to-goal signals with injected noise to mimic real predictor uncertainty. Deployment (Stage B): the trained RL policy consumes predictions from the learned VLD model—specified by either image or text goals—to navigate in simulated and real-world environments.

Method: Vision-Language Goal Distance

VLD learns a temporal distance function that maps a current egocentric observation and a goal specification (image and/or text) to the expected number of steps required to reach the goal under an optimal policy.

Architecture

VLD architecture

RGB observations are encoded using a frozen DINOv2-small backbone, while text goals are encoded with a CLIP text encoder and projected into the same token space. Image and text tokens are concatenated into a joint goal memory, enabling:

  • image-only goals,
  • text-only goals, and
  • multimodal image + text conditioning.

A Transformer decoder attends from observation queries to goal tokens. From the CLS output, lightweight MLP heads predict both the temporal distance and a calibrated confidence score.

Training Objective

To handle inherent ambiguity in temporal distance, VLD is trained with an inlier-outlier Gaussian mixture negative log-likelihood. The model learns to predict both distance and a calibrated confidence, downweighting ambiguous long-horizon pairs while retaining strong supervision near the goal. Negative mining across trajectories encourages near-maximal distances for unrelated scenes, improving robustness to off-goal and out-of-distribution targets.

Evaluating Distance: Ordinal Consistency

Instead of demanding exact step counts, we evaluate VLD by its ordering of distances: distances should decrease as the agent approaches the goal and increase when moving away. This is measured with Kendall's τ, a scale-invariant rank correlation coefficient.

Habitat (HM3D & Gibson)

On synthetic Habitat environments, VLD trained on Habitat (VLD (synthetic)) alone or combined with real-world data (VLD (all)) substantially outperforms ViNT and VIP variants across 20/50/100-step horizons. VLD maintains strong ordinal consistency even for long-range predictions where baselines degrade.

Table 1 — Ordinal consistency on Habitat (HM3D & Gibson). Kendall's τ for different temporal horizons (20/50/100 steps). Higher is better.

Model HM3D (↑) Gibson (↑)
2050100 2050100
ViNT 0.670.560.48 0.780.650.64
VIP 0.550.450.38 0.650.510.50
VLD variants
VLD (synthetic) 0.810.710.62 0.840.740.71
VLD (real-world) 0.100.070.07 0.120.100.09
VLD (all) 0.820.700.61 0.840.730.71

Real-World Datasets

On in-the-wild and embodiment trajectories, VLD trained on synthetic + real (VLD (all)) data generalizes best, while synthetic-only models rank second. Real-only training (VLD (real-world)) often collapses toward near-constant predictions, highlighting the importance of structured simulator data.

Table 2 — Ordinal consistency on real-world trajectories. Kendall's τ on in-the-wild YouTube-style videos and robot embodiment datasets.

Model In-the-wild (↑) Embodiment (↑)
50100 50100
ViNT 0.400.29 0.480.37
VIP 0.320.23 0.460.39
VLD variants
VLD (synthetic) 0.440.31 0.580.48
VLD (real-world) 0.230.18 0.160.14
VLD (all) 0.690.61 0.730.63

Text and Multimodal Goals

With bootstrapped text descriptions for object goals, VLD supports text-only and image + text goal conditioning:

  • Image+text achieve the best τ (text goal descriptions often ignored in favour of image descriptions).
  • Image goals very closely match image+text goals.
  • Text-only is noisier but still yields non-trivial ordinal structure, comparable to other image-based baselines.

Table 3 — Ordinal consistency for text-specified goals (HM3D). VLD supports image, text-only, and multimodal (image+text) goal specifications.

Model 20 (↑) 50 (↑) 100 (↑)
ViNT 0.670.560.48
VIP 0.550.450.38
VLD (goal modalities)
VLD (image) 0.810.700.61
VLD (text) 0.490.490.44
VLD (image+text) 0.810.710.62
Videos

Qualitative Distance Prediction Demos

Rollouts showing VLD distance predictions (and baselines) across synthetic Habitat environments, text-goal settings, and real-world videos. Use the carousels below to browse multiple trajectories per category.

Habitat — image-goal distance predictions

Habitat — text and multimodal goals

Real-world — embodiment datasets

Real-world — in-the-wild walking videos

Navigation Policy and Results

RL with Noisy Distance Signals

VLD is plugged into a point-goal navigation setup where RL policies are trained entirely in Gibson simulation with privileged geometric distances. At deployment, the distance channel is swapped with VLD predictions.

Policy Network

Policy Network

Policy Network

The policy receives:

  • lightweight exteroception (obstacle awareness),
  • proprioception and previous action, and
  • distance-to-goal plus confidence.

Inputs are encoded with small MLPs and fed into an LSTM core, followed by policy and value heads trained with PPO.

Geometric Overlap Noise

To avoid brittle policies, the simulator's geometric distance is perturbed with a learned geometric-overlap noise model conditioned on:

  • projection success ratio (visual overlap),
  • relative rotation and translation.

The noise model outputs distributions over distance and confidence bins, producing samples that mimic VLD's error patterns and improve robustness to noisy distance inputs.

Gibson Navigation Performance

Policies trained with ground-truth distances nearly solve Gibson point-goal navigation. When the distance input is replaced by VLD at test time, the policy retains a strong success rate while operating only on a single scalar goal signal plus lightweight perception.

Success rate (SR) and success weighted by path length (SPL) for point-goal navigation in Gibson. “Swap” indicates that the privileged geometric distance input is replaced by the indicated distance predictor only at deployment time.

Policy configuration SR ↑ SPL ↑
Privileged training (ground-truth distance)
GT distance (no noise) 0.9577 0.6103
GeoNoise 0.9091 0.5547
GeoNoise + confidence 0.8994 0.5809
Trained directly on VLD
Policy trained end-to-end on VLD 0.5664 0.4227
Swap: replace distance channel at deployment
VLD + (GeoNoise) 0.7314 0.3995
VLD + (GeoNoise + confidence) 0.6821 0.3860
ViNT + (GeoNoise) 0.6046 0.2555
VIP + (GeoNoise) 0.2787 0.1148
External image-based baseline (reported)
OVRL-V2 0.8200 0.5870
Videos

RL policy rollouts: successes and failures

Example Gibson episodes executed by the policy when driven by VLD distances. Successful rollouts show efficient progress and goal alignment, while failure cases highlight common failure modes such as near-goal misses, wrong-goal attraction, and timeouts.

Successful VLD-driven trajectories

Failure modes

BibTeX

@misc{milikic2025vldvisuallanguagegoal,
      title={VLD: Visual Language Goal Distance for Reinforcement Learning Navigation}, 
      author={Lazar Milikic and Manthan Patel and Jonas Frey},
      year={2025},
      eprint={2512.07976},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.07976}, 
}