NaviTrace is a novel VQA benchmark for evaluating vision–language models (VLMs) on their embodiment-specific understanding of navigation across diverse real-world scenarios. Given a natural-language instruction and an embodiment type (human, legged robot, wheeled robot, bicycle), a model must output a 2D navigation path in image space, which we call a trace.
The benchmark includes 1000 scenarios with 3000+ expert traces, divided into:
The dataset is available on Hugging Face. We provide ready-to-use evaluation scripts for API-based model inference and scoring, along with a leaderboard where users can compute scores on the test split and optionally submit their models for public comparison.
Vision–language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models’ navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation path in image space, which we call a trace. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation.
✏️Task: Loading task ...
Loading reasoning...
NaviTrace contains 1000 scenarios with more than 3000 traces across four embodiments (human, legged robot, wheeled robot, bicycle). Each scenario captures realistic navigation challenges with an image, language instruction, ground-truth traces, embodiment types, and task categories describing the main difficulties of the navigation task. We divide the dataset evenly into a validation split for fine-tuning applications and a test split with hidden annotations for a fair evaluation on our public leaderboard.
Left: Geographic distribution of image sources, with the inner circle denoting countries and the outer circle specifying cities or regions. Right: Distribution of scenarios by setting (urban vs. rural), environment type (natural vs. structured), lighting, and weather.
We design a score function that balances three factors: (i) how closely the path follows the ground truth with the Dynamic Time Warping error (DTW), (ii) whether the prediction reaches the intended goal with the Final Displacement Error (FDE), and (iii) whether it avoids unsafe or irrelevant regions with a penalty based on embodiment-specific semantic costs using a Mask2Former model.
Formally, a trace is a sequence of points $T=[(x_1, y_1), \dots, (x_n, y_n)]$ in image space. We compare it against multiple ground-truth traces describing equally good solutions $T'=[(x'_1, y'_1), \dots, (x'_m, y'_m)] \in \mathcal{G}$ and select the trace with the lowest error:
$$\textrm{Score}(T, \mathcal{G}) = \underset{T' \in \mathcal{G}}{\min} \, \textrm{DTW}(T, T') + \textrm{FDE}(T, T') + \textrm{Penalty}(T)$$In order to make the score easier to interpret, we scale the score values to a range where the performance of just drawing a vertical line through the image center (Straight Forward baseline) is at $0$ and a perfect score is at $100$.
$$\widehat{\textrm{Score}}(T, \mathcal{G}) = \frac{3234.75 - \textrm{Score}(T, \mathcal{G})}{3234.75} \cdot 100$$
We show that the score function aligns with human preference by calculating the correlation between the score ranking and a pairwise ranking created by a human.
@article{Windecker2025NaviTrace,
author = {Tim Windecker and Manthan Patel and Moritz Reuss and Richard Schwarzkopf and Cesar Cadena and Rudolf Lioutikov and Marco Hutter and Jonas Frey},
title = {NaviTrace: Evaluating Embodied Navigation of Vision-Language Models},
year = {2025},
month = {October},
journal = {Preprint submitted to arXiv},
note = {Currently a preprint on arXiv (arXiv:2510.26909). Awaiting peer review and journal submission.},
doi = {},
url = {https://arxiv.org/abs/2510.26909},
eprint={2510.26909},
archivePrefix={arXiv},
primaryClass={cs.RO},
}