Accepted to ECCV 2026

Argus: Metric Panoramic 3D Reconstruction
for Indoor Scenes

A data-driven feed-forward network that reconstructs complete, consistent, metric-scale indoor 3D scenes from sparse, unordered panoramic images.

Xi Li¹ Linyuan Li¹ Yan Wu¹ Tong Rao¹ Kai Zhang¹ Xinchen Hui¹ Cihui Pan^1,2

¹Realsee, China ²Quanzhou University of Information Engineering, China

Paper Code Model Demo Dataset Citation

Argus teaser: from sparse panoramic views to a complete metric 3D indoor scene

Argus takes a sparse, unordered set of panoramic images and reconstructs a complete, geometrically consistent, and metric-scale 3D indoor scene in a single feed-forward pass.

Abstract

Metric feed-forward 3D reconstruction for panoramic data remains under-explored due to the lack of large-scale panoramic RGB-D training data. We present Realsee3D, a hybrid dataset of 10K indoor scenes (1K real, 9K synthetic) with 299K panoramic viewpoints and precise metric annotations, and Argus, a feed-forward network trained on it for metric panoramic 3D reconstruction. In the sparse unordered capture setting of Realsee3D, a poorly chosen coordinate anchor can cause global pose drift.

Argus addresses this with a learned covisibility module that selects the geometrically optimal reference view to anchor the metric world frame. To further improve multi-task learning, we decompose the bidirectional pixel-to-world mapping into interpretable sub-steps with per-step supervision and cross-coordinate joint constraints, reinforcing geometric consistency across prediction branches. On the Realsee3D benchmark, Argus achieves state-of-the-art metric performance in camera pose estimation, depth estimation, and point cloud reconstruction.

Interactive Reconstruction

From a sparse set of panoramic views, Argus reconstructs a complete, metric-scale 3D point cloud in a single feed-forward pass. Pick a scene, then drag to rotate, scroll to zoom, and right-drag to pan the live reconstruction below.

Input panoramic views

Loading 3D model…

Drag · rotate Scroll · zoom Right-drag · pan

Point size

Method

Argus first extracts patch tokens from each panoramic view with a DINOv2 backbone. A lightweight Covisibility Transformer predicts a global covisibility score per view and selects the highest-scoring view as the reference frame, which receives a dedicated reference camera token. A deeper Geometry Transformer then aggregates multi-view geometric features, after which several prediction heads regress camera poses (from camera tokens via MLP) and dense outputs — depth and cross-coordinate-frame point maps — via independent DPT heads.

The core idea is to explicitly factorize the bidirectional pixel-to-world transform into several interpretable intermediate geometric representations, supervising each sub-step individually and enforcing joint pose constraints across coordinate frames. This lowers optimization difficulty and strengthens multi-task synergy. The model outputs native metric scale without post-hoc alignment.

Network overview. A Covisibility Transformer selects the reference view, a reference-based Geometry Transformer aggregates multi-view features, and multiple heads predict the factorized geometric representations.

Geometric factorization. The bidirectional pixel↔world transform of a single view is decomposed into explicit, individually supervised geometric steps.

Realsee3D Dataset

1,000

Real-world scenes
Galois-P4 LiDAR

9,000

Synthetic scenes
from real floorplans

95,962

Total rooms
9,483 real + 86,479 synth

299,073

Total viewpoints
24,263 real + 274,810 synth

Realsee3D combines LiDAR-captured real scenes with large-scale synthetic scenes, providing sparse unordered panoramic image sets that reflect practical acquisition.

Explore the Realsee3D Dataset

Results

Qualitative comparison with prior methods

Qualitative comparison. Argus reconstructs more accurate metric geometry and sharper structural boundaries than prior methods.

Covisibility learning. Argus selects reference views that are more central and better connected within the scene, anchoring a stable metric coordinate system.

Zero-shot reconstruction on unseen datasets

Zero-shot reconstruction. Argus generalizes to unseen panoramic datasets it was never trained on.

Citation

@misc{li2026argusmetricpanoramic3d,
      title={Argus: Metric Panoramic 3D Reconstruction for Indoor Scenes}, 
      author={Xi Li and Linyuan Li and Yan Wu and Tong Rao and Kai Zhang and Xinchen Hui and Cihui Pan},
      year={2026},
      eprint={2606.30047},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.30047}, 
}