WildOS: Open-Vocabulary Object Search in the Wild

Shah, Hardik; Tevere, Erica; Atha, Deegan; Kaufmann, Marcel; Khattak, Shehryar; Patel, Manthan; Hutter, Marco; Frey, Jonas; Spieler, Patrick

WildOS: Open-Vocabulary Object Search in the Wild

Hardik Shah^1,2, Erica Tevere¹, Deegan Atha¹, Marcel Kaufmann¹, Shehryar Khattak¹, Manthan Patel², Marco Hutter², Jonas Frey^2,3,4, Patrick Spieler¹

¹Jet Propulsion Laboratory (JPL), NASA ²Robotics Systems Lab, ETH Zurich
³Stanford University ⁴University of California, Berkeley

Paper arXiv Videos Code Model Weights Dataset

Video teaser summarizing the WildOS system. See visualization legend →

Abstract

Autonomous navigation in complex, unstructured outdoor environments requires robots to operate over long ranges without prior maps and limited depth sensing. In such settings, relying solely on geometric frontiers for exploration is often insufficient; In such settings, the ability to reason semantically about where to go and what is safe to traverse is crucial for robust, efficient exploration. This work presents WildOS, a unified system for long-range, open-vocabulary object search that combines safe geometric exploration with semantic visual reasoning. WildOS builds a sparse navigation graph to maintain spatial memory, while utilizing a foundation-model-based vision module, ExploRFM, to score frontier nodes of the graph. ExploRFM simultaneously predicts traversability, visual frontiers, and object similarity in image space, enabling real-time, onboard semantic navigation tasks. The resulting vision-scored graph enables the robot to explore semantically meaningful directions while ensuring geometric safety. Furthermore, we introduce a particle-filter-based method for coarse localization of the open-vocabulary target query, that estimates candidate goal positions beyond the robot’s immediate depth horizon, enabling effective planning toward distant goals. Extensive closed-loop field experiments across diverse off-road and urban terrains demonstrate that WildOS enables robust navigation, significantly outperforming purely geometric and purely vision-based baselines in both efficiency and autonomy. Our results highlight the potential of vision foundation models to drive open-world robotic behaviors that are both semantically informed and geometrically grounded.

(a) WildOS enables autonomous semantic navigation in diverse unstructured outdoor environments. (b, c) Due to the limited range of geometric sensing, robots can only reliably perceive nearby regions within a depth horizon (blue), leading to myopic exploration and (e) difficulty localizing distant targets (e.g., a “house”) beyond sensing range. (b, c) Conventional exploration (dashed path) relies on geometric frontiers (blue dots) at the boundary between known and unknown space, which ignores long-range semantic and traversability cues. WildOS (green path) augments geometric exploration with long-range visual reasoning using a vision foundation model, defining a visual horizon (red) that extends beyond depth sensing and predicts visual traversability, visual frontiers (red dots), and open-vocabulary object similarity in image space. (d) During deployment, a sparse navigation graph is built from geometry and frontier nodes are scored using vision, while a particle-filter-based goal localization module (yellow particles) estimates candidate goal locations beyond the depth horizon, enabling safe, efficient planning toward distant semantic goals.

Method Overview

Method Overview consisting of five main components: 1) WildOS incrementally builds a sparse navigation graph from geometric sensing to maintain persistent spatial memory and identify geometric frontier nodes for safe exploration. 2) To reason beyond the limited depth horizon, a learned vision-language module, ExploRFM, processes the current image and text query to predict visual traversability, visual frontiers, and open-vocabulary object similarity over a long-range visual horizon. 3) Object detections from multiple viewpoints are fused by a probabilistic goal triangulation module to estimate a coarse 3D target location beyond direct sensor range. 4) Geometric frontier nodes are then projected into the image and scored using the visual-semantic cues and the current goal estimate, producing a semantically scored navigation graph. 5) Finally, a hierarchical planner selects and executes actions by planning over the scored graph and generating locally safe motions toward intermediate goals.

Frontier Annotations on GrandTour

Visualizing a subset of the annotated visual frontiers on GrandTour dataset images.
Red regions (visual frontiers) in the image indicate candidate locations for further exploration.

Example predictions of Visual Traversability and Frontiers from ExploRFM in varied terrains.

Confidence maps from the network are thresholded. Visual Frontiers are shown in Jet (Red indicating higher frontier confidence). Visual Traversability is shown in inverse Jet (Blue indicating higher traversability confidence).

Open-Vocabulary Object Search

Q1: Does the complete WildOS system enable successful end-to-end object search from language queries?

Searching for "Orange Flag" - Third-person view and RViz visualization

The robot grounds arbitrary language queries into visual similarity maps, estimates coarse 3D goal locations through triangulation, and autonomously navigates toward them.

Key Insight

WildOS successfully integrates language grounding, vision-based localization, and geometric planning to enable end-to-end open-vocabulary object search operating in real time on a deployed robot platform.

Fence Approach: Vision-Guided Navigation

Q2: Does integrating vision-based scoring with the navigation graph improve navigation performance compared to pure-geometry approaches?

WildOS: Third-person view and RViz visualization

LRN Baseline: Vision-only navigation without spatial memory

Vanilla GraphNav: Geometry-only exploration

Both vision-guided approaches (LRN and WildOS) immediately identify the corridor between the fences and plan paths toward it. Vanilla GraphNav proceeds straight toward the fence until it appears in the local map, forcing a late detour.

Key Insight

WildOS achieves lower average distance and time with notably smaller variance compared to baselines. Vision based scoring enables the robot to plan efficient routes around obstructions rather than heading straight toward blocked directions. This replicates human-like reasoning that prefers affordable directions over direct ones.

Dead End Recovery: Spatial Memory

Q3: Does the navigation graph improve robustness and memory compared to purely vision-based navigation?

WildOS: Third-person view and RViz visualization showing successful recovery

WildOS consistently completes the mission successfully in all runs, while LRN requires human intervention every time. WildOS initially chooses the shorter path, reaches the dead-end, and upon clearing the local frontier nodes, reroutes using the navigation graph.

Key Insight

Persistent spatial memory is essential for long-horizon autonomy. By maintaining a structured representation of previously explored regions and deferred frontiers, WildOS can recover from dead-ends and replan effectively, whereas memoryless vision-only strategies remain prone to oscillation and repeated failure.

Urban Navigation: Cross-Terrain Generalization

Q4: Does WildOS generalize effectively across diverse outdoor terrains?

Urban Environment - Run 1

Successful autonomous navigation runs in urban environments.

Key Insight

WildOS exhibits strong generalization across diverse terrains — from off-road unstructured environments to urban settings — enabled by foundation-model features. The system adapts seamlessly without requiring retraining or environment-specific tuning, highlighting the potential of vision foundation models to drive open-world robotic behaviors.

Visualization Legend

Visual Outputs: Frontier and Traversability maps from ExploRFM are thresholded and shown in Jet and Inverse Jet colormaps respectively. When the full model visualization is shown (outputs from all three cameras), we overlay the full heatmaps (without thresholding) on the image using a Jet colormap for both.

Navigation Graph: Edges are shown in red, free nodes in green, and frontier nodes in blue.

Scored Graph: Frontier nodes are surrounded with a score ring indicating the score ( color according to the Jet colormap) in the goal direction.

Goal & Planning: The triangulated goal is shown as a cyan sphere, with projected particles in white. The planned high-level path is shown in green.

This color scheme is followed throughout the visualizations unless stated otherwise.

BibTeX

        @misc{shah2026wildosopenvocabularyobjectsearch,
              title={WildOS: Open-Vocabulary Object Search in the Wild}, 
              author={Hardik Shah and Erica Tevere and Deegan Atha and Marcel Kaufmann and Shehryar Khattak and Manthan Patel and Marco Hutter and Jonas Frey and Patrick Spieler},
              year={2026},
              eprint={2602.19308},
              archivePrefix={arXiv},
              primaryClass={cs.RO},
              url={https://arxiv.org/abs/2602.19308}, 
        }