NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

Tim Windecker^1,2, Moritz Reuss ², Manthan Patel ¹, Richard Schwarzkopf³, Cesar Cadena¹, Rudolf Lioutikov^2,4, Marco Hutter¹, Jonas Frey ¹

¹Robotic Systems Lab, ETH Zurich, Zurich, Switzerland, ²Intuitive Robots Lab, KIT, Karlsruhe, Germany, ³FZI Research Center for Information Technology, Karlsruhe, Germany, ⁴Robotics Institute Germany

Paper Code Dataset Leaderboard

NaviTrace is a novel VQA benchmark for VLMs that evaluates models on their embodiment-specific understanding of navigation across challenging real-world scenarios.

TL;DR

NaviTrace is a novel VQA benchmark for evaluating vision–language models (VLMs) on their embodiment-specific understanding of navigation across diverse real-world scenarios. Given a natural-language instruction and an embodiment type (human, legged robot, wheeled robot, bicycle), a model must output a 2D navigation path in image space, which we call a trace.

The benchmark includes 1000 scenarios with 3000+ expert traces, divided into:

Validation split (50%) for experimentation and model fine-tuning.
Test split (50%) with hidden ground truths for a fair leaderboard evaluation.

The dataset is available on Hugging Face. We provide ready-to-use evaluation scripts for API-based model inference and scoring, along with a leaderboard where users can compute scores on the test split and optionally submit their models for public comparison.

Abstract

Vision–language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models’ navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation path in image space, which we call a trace. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation.

Video

Leaderboard

Loading chart...

Examples

✏️Task: Loading task ...

Traces

Ground-Truth Gemini 2.5 Pro GPT-5 Qwen 3 VL o3

Reasoning

Show Details

Gemini 2.5 Pro GPT-5 Qwen 3 VL o3

Loading reasoning...

Key Findings

Large human performance gap. Across all four embodiments and task categories, VLM scores are substantially worse than both human and oracle-like baselines, highlighting significant room for improvement.
Goal localization is the dominant failure mode. When models predict only the goal location and we connect it with a straight line, scores are similar to full-trace predictions. Yet even with the correct goal, path shaping lags behind human performance.
Embodiment robustness. Aggregate performance differences across Human, Legged Robot, Wheeled Robot, and Bicycle embodiments are small, suggesting general limitations in spatial grounding rather than embodiment-specific blind spots.
Score function alignment with human preference. Our semantic-aware trace score, that builds on the DTW distance with endpoint error and embodiment-conditioned penalties using automated semantics, correlates more strongly with human preference than DTW alone. Using manual segmentation yields an additional but modest gain.

Data

NaviTrace contains 1000 scenarios with more than 3000 traces across four embodiments (human, legged robot, wheeled robot, bicycle). Each scenario captures realistic navigation challenges with an image, language instruction, ground-truth traces, embodiment types, and task categories describing the main difficulties of the navigation task. We divide the dataset evenly into a validation split for fine-tuning applications and a test split with hidden annotations for a fair evaluation on our public leaderboard.

Image Location

Left: Geographic distribution of image sources, with the inner circle denoting countries and the outer circle specifying cities or regions. Right: Distribution of scenarios by setting (urban vs. rural), environment type (natural vs. structured), lighting, and weather.

Task Categories

Score Function

Formula

We design a score function that balances three factors: (i) how closely the path follows the ground truth with the Dynamic Time Warping error (DTW), (ii) whether the prediction reaches the intended goal with the Final Displacement Error (FDE), and (iii) whether it avoids unsafe or irrelevant regions with a penalty based on embodiment-specific semantic costs using a Mask2Former model.

Formally, a trace is a sequence of points $T=[(x_1, y_1), \dots, (x_n, y_n)]$ in image space. We compare it against multiple ground-truth traces describing equally good solutions $T'=[(x'_1, y'_1), \dots, (x'_m, y'_m)] \in \mathcal{G}$ and select the trace with the lowest error:

$$\textrm{Score}(T, \mathcal{G}) = \underset{T' \in \mathcal{G}}{\min} \, \textrm{DTW}(T, T') + \textrm{FDE}(T, T') + \textrm{Penalty}(T)$$

In order to make the score easier to interpret, we scale the score values to a range where the performance of just drawing a vertical line through the image center (Straight Forward baseline) is at $0$ and a perfect score is at $100$.

$$\widehat{\textrm{Score}}(T, \mathcal{G}) = \frac{3234.75 - \textrm{Score}(T, \mathcal{G})}{3234.75} \cdot 100$$

Alignment with Human Judgement

Experiment to show the score's alignment with human preference

We show that the score function aligns with human preference by calculating the correlation between the score ranking and a pairwise ranking created by a human.

BibTeX

@article{Windecker2025NaviTrace,
  author  = {Tim Windecker and Manthan Patel and Moritz Reuss and Richard Schwarzkopf and Cesar Cadena and Rudolf Lioutikov and Marco Hutter and Jonas Frey},
  title   = {NaviTrace: Evaluating Embodied Navigation of Vision-Language Models},
  year    = {2025},
  month   = {October},
  journal = {Preprint submitted to arXiv},
  note    = {Currently a preprint on arXiv (arXiv:2510.26909). Awaiting peer review and journal submission.},
  doi     = {},
  url     = {https://arxiv.org/abs/2510.26909},
  eprint={2510.26909},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
}