NaviTrace: Evaluating Embodied Navigation of Vision-Language Models

Tim Windecker1,2, Jonas Frey 1, Moritz Reuss 2, Manthan Patel 1, Richard Schwarzkopf3, Cesar Cadena1, Rudolf Lioutikov2, Marco Hutter1
1Robotic Systems Lab, ETH Zurich, Zurich, Switzerland, 2Intuitive Robots Lab, KIT, Karlsruhe, Germany, 3FZI Research Center for Information Technology, Karlsruhe, Germany
NaviTrace Overview

NaviTrace is a novel VQA benchmark for VLMs that evaluates models on their embodiment-specific understanding of navigation across challenging real-world scenarios.

Abstract

Vision–language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models’ navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate seven state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation.

Video

Data

NaviTrace contains 1000 scenarios with more than 3000 traces across four embodiments (human, legged robot, wheeled robot, bicycle). Each scenario captures realistic navigation challenges with an image, language instruction, ground-truth traces, embodiment types, and task categories describing the main difficulties of the navigation task. We divide the dataset evenly into a validation split for fine-tuning applications and a test split with hidden annotations for a fair evaluation on our public leaderboard.

Data Diversity

Left: Geographic distribution of image sources, with the inner circle denoting countries and the outer circle specifying cities or regions. Right: Distribution of scenarios by setting (urban vs. rural), environment type (natural vs. structured), lighting, and weather.


Score Function

Formula

We design a score function that balances three factors: (i) how closely the path follows the ground truth with the Dynamic Time Warping error (DTW), (ii) whether the prediction reaches the intended goal with the Final Displacement Error (FDE), and (iii) whether it avoids unsafe or irrelevant regions with a penalty based on embodiment-specific semantic costs using a Mask2Former model.

Formally, a trace is a sequence of points $T=[(x_1, y_1), \dots, (x_n, y_n)]$ in image space. We compare it against multiple ground-truth traces describing equally good solutions $T'=[(x'_1, y'_1), \dots, (x'_m, y'_m)] \in \mathcal{G}$ and select the trace with the lowest error:

$$\textrm{Score}(T, \mathcal{G}) = \underset{T' \in \mathcal{G}}{\max} \, \textrm{DTW}(T, T') + \textrm{FDE}(T, T') + \textrm{Penalty}(T)$$

Alignment with Human Judgement

Experiment to show the score's alignment with human preference

We show that the score function aligns with human preference by calculating the correlation between the score ranking and a pairwise ranking created by a human.


Results

Experiment to show the score's alignment with human preference

Left: Ranking of VLMs, the uninformed baseline Straight Forward, and human expert performance split into each embodiment. Note that a lower score is better. Right: Performance per task category for the same models.

Example predictions

Example predictions by the models Gemini 2.5 Pro, GPT-5, and o3.

Example reasoning

Example of o3's reasoning with the prediction in pink on the left and the steps on the right. The model reasons correctly but is unable to predict a corresponding trace.


Key Findings

  1. Large human performance gap. Across all four embodiments and task categories, VLM scores are substantially worse than both human and oracle-like baselines, highlighting significant room for improvement.
  2. Goal localization is the dominant failure mode. When models predict only the goal location and we connect it with a straight line, scores are similar to full-trace predictions. Yet even with the correct goal, path shaping lags behind human performance.
  3. Embodiment robustness. Aggregate performance differences across Human, Legged Robot, Wheeled Robot, and Bicycle embodiments are small, suggesting general limitations in spatial grounding rather than embodiment-specific blind spots.
  4. Score function alignment with human preference. Our semantic-aware trace score, that builds on the DTW distance with endpoint error and embodiment-conditioned penalties using automated semantics, correlates more strongly with human preference than DTW alone. Using manual segmentation yields an additional but modest gain.


BibTeX

Not published yet