Vision–language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models’ navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate seven state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation.
NaviTrace contains 1000 scenarios with more than 3000 traces across four embodiments (human, legged robot, wheeled robot, bicycle). Each scenario captures realistic navigation challenges with an image, language instruction, ground-truth traces, embodiment types, and task categories describing the main difficulties of the navigation task. We divide the dataset evenly into a validation split for fine-tuning applications and a test split with hidden annotations for a fair evaluation on our public leaderboard.
Left: Geographic distribution of image sources, with the inner circle denoting countries and the outer circle specifying cities or regions. Right: Distribution of scenarios by setting (urban vs. rural), environment type (natural vs. structured), lighting, and weather.
We design a score function that balances three factors: (i) how closely the path follows the ground truth with the Dynamic Time Warping error (DTW), (ii) whether the prediction reaches the intended goal with the Final Displacement Error (FDE), and (iii) whether it avoids unsafe or irrelevant regions with a penalty based on embodiment-specific semantic costs using a Mask2Former model.
Formally, a trace is a sequence of points $T=[(x_1, y_1), \dots, (x_n, y_n)]$ in image space. We compare it against multiple ground-truth traces describing equally good solutions $T'=[(x'_1, y'_1), \dots, (x'_m, y'_m)] \in \mathcal{G}$ and select the trace with the lowest error:
$$\textrm{Score}(T, \mathcal{G}) = \underset{T' \in \mathcal{G}}{\max} \, \textrm{DTW}(T, T') + \textrm{FDE}(T, T') + \textrm{Penalty}(T)$$We show that the score function aligns with human preference by calculating the correlation between the score ranking and a pairwise ranking created by a human.
Left: Ranking of VLMs, the uninformed baseline Straight Forward, and human expert performance split into each embodiment. Note that a lower score is better. Right: Performance per task category for the same models.
Example predictions by the models Gemini 2.5 Pro, GPT-5, and o3.
Example of o3's reasoning with the prediction in pink on the left and the steps on the right. The model reasons correctly but is unable to predict a corresponding trace.
Not published yet