Abstract
Training end-to-end policies from image data to directly predict navigation actions for robotic systems has proven inherently difficult. Existing approaches often suffer from either the sim-to-real gap during policy transfer or a limited amount of training data with action labels. To address this problem, we introduce Vision-Language Distance (VLD) learning, a scalable framework for goal-conditioned navigation that decouples perception learning from policy learning.
Instead of relying on raw sensory inputs during policy training, we first train a self-supervised distance-to-goal predictor on internet-scale video data. This predictor generalizes across both image- and text-based goals, providing a distance signal that can be minimized by a reinforcement learning (RL) policy. The RL policy can be trained entirely in simulation using privileged geometric distance signals, with injected noise to mimic the uncertainty of the trained distance predictor.
At deployment, the policy consumes VLD predictions, inheriting semantic goal information—“where to go”—from large-scale visual training while retaining the robust low-level navigation behaviors learned in simulation. We propose using ordinal consistency to assess distance functions directly and demonstrate that VLD outperforms prior temporal distance approaches such as ViNT and VIP. Experiments show that our decoupled design achieves competitive navigation performance in simulation while supporting flexible goal modalities, providing a scalable path toward reliable, multimodal navigation policies.
Training (Stage A): we separately train a temporal Vision-Language Distance (VLD) function on diverse real-world and synthetic video datasets, and an RL navigation policy in simulation using geometric distance-to-goal signals with injected noise to mimic real predictor uncertainty. Deployment (Stage B): the trained RL policy consumes predictions from the learned VLD model—specified by either image or text goals—to navigate in simulated and real-world environments.
Method: Vision-Language Goal Distance
VLD learns a temporal distance function that maps a current egocentric observation and a goal specification (image and/or text) to the expected number of steps required to reach the goal under an optimal policy.
Architecture
RGB observations are encoded using a frozen DINOv2-small backbone, while text goals are encoded with a CLIP text encoder and projected into the same token space. Image and text tokens are concatenated into a joint goal memory, enabling:
- image-only goals,
- text-only goals, and
- multimodal image + text conditioning.
A Transformer decoder attends from observation queries to goal tokens. From the CLS output, lightweight MLP heads predict both the temporal distance and a calibrated confidence score.
Training Objective
To handle inherent ambiguity in temporal distance, VLD is trained with an inlier-outlier Gaussian mixture negative log-likelihood. The model learns to predict both distance and a calibrated confidence, downweighting ambiguous long-horizon pairs while retaining strong supervision near the goal. Negative mining across trajectories encourages near-maximal distances for unrelated scenes, improving robustness to off-goal and out-of-distribution targets.
Evaluating Distance: Ordinal Consistency
Instead of demanding exact step counts, we evaluate VLD by its ordering of distances: distances should decrease as the agent approaches the goal and increase when moving away. This is measured with Kendall's τ, a scale-invariant rank correlation coefficient.
Habitat (HM3D & Gibson)
On synthetic Habitat environments, VLD trained on Habitat (VLD (synthetic)) alone or combined with real-world data (VLD (all)) substantially outperforms ViNT and VIP variants across 20/50/100-step horizons. VLD maintains strong ordinal consistency even for long-range predictions where baselines degrade.
Table 1 — Ordinal consistency on Habitat (HM3D & Gibson). Kendall's τ for different temporal horizons (20/50/100 steps). Higher is better.
| Model | HM3D (↑) | Gibson (↑) | ||||
|---|---|---|---|---|---|---|
| 20 | 50 | 100 | 20 | 50 | 100 | |
| ViNT | 0.67 | 0.56 | 0.48 | 0.78 | 0.65 | 0.64 |
| VIP | 0.55 | 0.45 | 0.38 | 0.65 | 0.51 | 0.50 |
| VLD variants | ||||||
| VLD (synthetic) | 0.81 | 0.71 | 0.62 | 0.84 | 0.74 | 0.71 |
| VLD (real-world) | 0.10 | 0.07 | 0.07 | 0.12 | 0.10 | 0.09 |
| VLD (all) | 0.82 | 0.70 | 0.61 | 0.84 | 0.73 | 0.71 |
Real-World Datasets
On in-the-wild and embodiment trajectories, VLD trained on synthetic + real (VLD (all)) data generalizes best, while synthetic-only models rank second. Real-only training (VLD (real-world)) often collapses toward near-constant predictions, highlighting the importance of structured simulator data.
Table 2 — Ordinal consistency on real-world trajectories. Kendall's τ on in-the-wild YouTube-style videos and robot embodiment datasets.
| Model | In-the-wild (↑) | Embodiment (↑) | ||
|---|---|---|---|---|
| 50 | 100 | 50 | 100 | |
| ViNT | 0.40 | 0.29 | 0.48 | 0.37 |
| VIP | 0.32 | 0.23 | 0.46 | 0.39 |
| VLD variants | ||||
| VLD (synthetic) | 0.44 | 0.31 | 0.58 | 0.48 |
| VLD (real-world) | 0.23 | 0.18 | 0.16 | 0.14 |
| VLD (all) | 0.69 | 0.61 | 0.73 | 0.63 |
Text and Multimodal Goals
With bootstrapped text descriptions for object goals, VLD supports text-only and image + text goal conditioning:
- Image+text achieve the best τ (text goal descriptions often ignored in favour of image descriptions).
- Image goals very closely match image+text goals.
- Text-only is noisier but still yields non-trivial ordinal structure, comparable to other image-based baselines.
Table 3 — Ordinal consistency for text-specified goals (HM3D). VLD supports image, text-only, and multimodal (image+text) goal specifications.
| Model | 20 (↑) | 50 (↑) | 100 (↑) |
|---|---|---|---|
| ViNT | 0.67 | 0.56 | 0.48 |
| VIP | 0.55 | 0.45 | 0.38 |
| VLD (goal modalities) | |||
| VLD (image) | 0.81 | 0.70 | 0.61 |
| VLD (text) | 0.49 | 0.49 | 0.44 |
| VLD (image+text) | 0.81 | 0.71 | 0.62 |
Qualitative Distance Prediction Demos
Rollouts showing VLD distance predictions (and baselines) across synthetic Habitat environments, text-goal settings, and real-world videos. Use the carousels below to browse multiple trajectories per category.
Habitat — image-goal distance predictions
Habitat Image-goal Example 1 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 2 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 3 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 4 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 5 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 6 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 7 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 8 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 9 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 10 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 11 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 12 (VLD vs. ViNT/VIP).
Habitat Image-goal Example 13 (VLD vs. ViNT/VIP).
Habitat — text and multimodal goals
Text-goal Example 1.
Text-goal Example 2.
Text-goal Example 3.
Multimodal Example 4.
Text-goal Example 5.
Text-goal Example 6.
Text-goal Example 7.
Real-world — embodiment datasets
Embodiment Image-goal Example 1 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 2 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 3 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 4 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 5 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 6 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 7 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 8 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 9 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 10 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 11 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 12 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 13 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 14 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 15 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 16 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 17 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 18 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 19 (VLD vs. ViNT/VIP).
Embodiment Image-goal Example 20 (VLD vs. ViNT/VIP).
Real-world — in-the-wild walking videos
In-the-wild Image-goal Example 1 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 2 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 3 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 4 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 5 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 6 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 7 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 8 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 9 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 10 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 11 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 12 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 13 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 14 (VLD vs. ViNT/VIP).
In-the-wild Image-goal Example 15 (VLD vs. ViNT/VIP).
Navigation Policy and Results
RL with Noisy Distance Signals
VLD is plugged into a point-goal navigation setup where RL policies are trained entirely in Gibson simulation with privileged geometric distances. At deployment, the distance channel is swapped with VLD predictions.
Policy Network
Policy Network
The policy receives:
- lightweight exteroception (obstacle awareness),
- proprioception and previous action, and
- distance-to-goal plus confidence.
Inputs are encoded with small MLPs and fed into an LSTM core, followed by policy and value heads trained with PPO.
Geometric Overlap Noise
To avoid brittle policies, the simulator's geometric distance is perturbed with a learned geometric-overlap noise model conditioned on:
- projection success ratio (visual overlap),
- relative rotation and translation.
The noise model outputs distributions over distance and confidence bins, producing samples that mimic VLD's error patterns and improve robustness to noisy distance inputs.
Gibson Navigation Performance
Policies trained with ground-truth distances nearly solve Gibson point-goal navigation. When the distance input is replaced by VLD at test time, the policy retains a strong success rate while operating only on a single scalar goal signal plus lightweight perception.
Success rate (SR) and success weighted by path length (SPL) for point-goal navigation in Gibson. “Swap” indicates that the privileged geometric distance input is replaced by the indicated distance predictor only at deployment time.
| Policy configuration | SR ↑ | SPL ↑ |
|---|---|---|
| Privileged training (ground-truth distance) | ||
| GT distance (no noise) | 0.9577 | 0.6103 |
| GeoNoise | 0.9091 | 0.5547 |
| GeoNoise + confidence | 0.8994 | 0.5809 |
| Trained directly on VLD | ||
| Policy trained end-to-end on VLD | 0.5664 | 0.4227 |
| Swap: replace distance channel at deployment | ||
| VLD + (GeoNoise) | 0.7314 | 0.3995 |
| VLD + (GeoNoise + confidence) | 0.6821 | 0.3860 |
| ViNT + (GeoNoise) | 0.6046 | 0.2555 |
| VIP + (GeoNoise) | 0.2787 | 0.1148 |
| External image-based baseline (reported) | ||
| OVRL-V2 | 0.8200 | 0.5870 |
RL policy rollouts: successes and failures
Example Gibson episodes executed by the policy when driven by VLD distances. Successful rollouts show efficient progress and goal alignment, while failure cases highlight common failure modes such as near-goal misses, wrong-goal attraction, and timeouts.
Successful VLD-driven trajectories
Success Example 1.
Success Example 2.
Success Example 3.
Success Example 4.
Success Example 5.
Success Example 6.
Success Example 7.
Success Example 8.
Success Example 9.
Success Example 10.
Success Example 11.
Success Example 12.
Success Example 13.
Success Example 14.
Success Example 15.
Success Example 16.
Success Example 17.
Success Example 18.
Success Example 19.
Success Example 20.
Success Example 21.
Success Example 22.
Success Example 23.
Success Example 24.
Success Example 25.
Success Example 26.
Success Example 27.
Success Example 28.
Failure modes
Failure Example 1 - Near match failure: Most of the goal view captured but agent stopped to close to the counter.
Failure Example 2 - Timeout failure: Agent spends too much time exploring the same areas (adding some memory of visited places should help).
Failure Example 3 - Near match failure: Agent stops too soon when it detects the couch from the goal view.
Failure Example 4 - Near match failure: Close but fails to satisfy success condition.
Failure Example 5 - Near match failure: Agent stops too soon when it detects some objects from the goal view.
Failure Example 6 - Timeout failure: Agent gets stuck in a local pattern (adding memory of visited places should help).
Failure Example 7 - Goal image ambiguously defined: Agent is attracted to a visually similar area (wall).
Failure Example 8 — Timeout failure: Agent gets stuck in a local pattern (adding memory of visited places should help).
Failure Example 9 - Near match failure: Agent stops at the first sight of furniture from goal image but too far away from the goal.
Failure Example 10 - Goal image ambiguously defined: Agent is attracted to a visually similar area (wall).
Failure Example 11 - Near match failure: Agent enteres the room from the goal image from a different entry point and fails to get close enough to the goal point.
Failure Example 12 - Timeout failure: Agent fails to full explore the region around the goal.
Failure Example 13 - Goal image ambiguously defined: Agent is attracted to a visually similar area (wall).
Failure Example 14 - Near match failure: Agent sees the bed from the goal image but stops too far away.
Failure Example 15 - Goal image ambiguously defined: goal is an all black image.
Failure Example 16 - Timeout failure: VLD navigates the agent off the correct direction and then it gets stuck.
Failure Example 17 - Timeout failure: Agent gets stuck in a local pattern (adding memory of visited places should help).
Failure Example 18 — Timeout failure: VLD navigates the agent off the correct direction and then it gets stuck.
Failure Example 19 — Near-goal failure: Agent detects some objects from the goal view but stops too far away.
Failure Example 20 — Near-goal failure: Agent sees some objects from the goal view but from a different angle and stops too far away.
BibTeX
@misc{milikic2025vldvisuallanguagegoal,
title={VLD: Visual Language Goal Distance for Reinforcement Learning Navigation},
author={Lazar Milikic and Manthan Patel and Jonas Frey},
year={2025},
eprint={2512.07976},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2512.07976},
}