Panoptic Scene Understanding and Object Tracking for Autonomous Operations in Dynamic
Construction Environments
Lorenzo Terenzi†, Julian Nubert†, §, Pol Eyschen†, Pascal Roth†, Simin Fei†, Edo
Jelavic†, and Marco Hutter†
Abstract
Construction sites are inherently complex and dynamic environments, posing significant challenges for autonomous systems due to their unstructured nature and the presence of dynamic actors such as workers and machinery. This project introduces a comprehensive panoptic scene understanding system tailored for such environments. By integrating 2D panoptic segmentation with 3D LiDAR mapping, our system generates detailed, real-time environmental representations that combine semantic and geometric data. This is further enhanced by a Kalman Filter-based tracking mechanism for dynamic object detection, allowing the system to effectively distinguish between static and dynamic elements in the scene.
We develop a fine-tuning method that adapts large pre-trained panoptic segmentation models, specifically Mask2Former and DETR architectures, for construction site applications using a limited number of domain-specific samples. Furthermore, we propose a novel dynamic panoptic mapping technique that fuses image-based segmentation with LiDAR data to create detailed maps essential for autonomous decision-making in construction sites. As a case study, we demonstrate the system's application in enabling real-time, reactive path planning using an online RRT* planner in dynamic scenarios.
In addition to the system, we release a first-of-its-kind dataset containing 502 hand-labeled images with panoptic annotations from construction sites, covering 35 semantic categories, along with code for model fine-tuning. This dataset and codebase are made publicly available to support further research in robotics and autonomous systems.
Approach
Our approach integrates 2D camera images and 3D LiDAR data to create a dynamic panoptic mapping system capable of operating in real time in complex construction environments. The system identifies both static "stuff" (e.g., terrain, buildings) and dynamic "things" (e.g., workers, machinery) and incorporates this information into a comprehensive panoptic map used for safe navigation planning.
Panoptic Segmentation
We utilize state-of-the-art transformer-based architectures, specifically Mask2Former and DETR, which we fine-tuned for construction environments. Due to the scarcity of labeled construction site data, we develop a fine-tuning method that adapts these large pre-trained models using a limited number of domain-specific samples, enhancing their performance in recognizing construction-specific objects and scenes.
Dynamic Mapping and Tracking
Our dynamic mapping system processes LiDAR scans to distinguish between static and dynamic elements in the environment. We project the panoptic segmentation results from the camera images onto the 3D LiDAR point cloud, labeling each point accordingly. Dynamic objects are detected and tracked using clustering algorithms and Kalman filters, allowing the system to maintain awareness of moving objects even when they leave the camera's field of view.
Semantic 2D Mapping
We build a semantic 2D map using the GridMap library, maintaining separate layers for static and dynamic elements. This map is continuously updated with new data from the segmentation and tracking modules. The merged semantic map provides occupancy and cost information crucial for navigation planning, distinguishing between traversable and non-traversable areas and assigning traversal costs based on terrain types.
Results
Panoptic Segmentation Performance
Our fine-tuned Mask2Former model demonstrated superior performance in segmenting construction site images, significantly outperforming the baseline DETR model. The evaluation metrics are as follows:
Model | PQ | SQ | RQ | Images/s |
---|---|---|---|---|
DETR | 0.41 | 0.62 | 0.52 | 12 |
Mask2Former Swin-Tiny | 0.68 | 0.79 | 0.81 | 8 |
Mask2Former Swin-Big | 0.63 | 0.77 | 0.73 | 3 |
Mask2Former Swin-Large | 0.68 | 0.79 | 0.80 | 1 |
Autonomous Navigation: We deployed the system on a Menzi Muck M545 excavator equipped with an RGB camera and LiDAR sensor. The system successfully enabled real-time navigation in dynamic construction environments. The panoptic segmentation output informed an online RRT* planner, allowing for reactive path planning in the presence of dynamic obstacles such as moving machinery and workers.
Dataset
To support further research in robotics and autonomous systems, we are releasing a unique dataset of 502 images with comprehensive panoptic annotations from construction sites. This dataset covers 35 actively selected semantic categories relevant to construction environments, including specialized labels such as "bucket", "gripper", "self-arm", "container", "stone", and "gravel-pile".
Dataset Features
- Number of images: 502 hand-labeled images
- Semantic categories: 35 categories, including both "stuff" and "things"
- Annotations: Panoptic segmentation masks with detailed object boundaries
- Data diversity: Images captured from multiple locations and under various conditions to ensure diversity
Citation
Acknowledgements
This work is supported by the NCCR Digital Fabrication & Robotics, the SNF project No. 188596, and
the Max Planck ETH Center for Learning Systems.
†All authors are with the Robotic Systems Lab, ETH Zürich.
§The author is with the MPI for Intelligent Systems, Stuttgart, Germany.
Corresponding author: Lorenzo Terenzi