DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

Abstract

Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging—especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.

Method

Qualitative Comparison

Disparity comparison with VGGT and Pi3 on in-the-wild and AI-generated videos

Choose a video

Depth comparison with GeometryCrafter on in-the-wild videos

Choose a video

Runtime

Inference time vs. number of frames for DAGE and Pi3 at 2K and 540p resolutions.

Thanks to the dual-stream design, we restrict the computationally heavy global attention to the LR stream with a smaller number of image tokens, alleviating the quadratic scaling bottleneck of global transformers as in VGGT and Pi3. This significantly reduces runtime by 2× and 28× at 540p and 2K resolutions, respectively, enabling our model to process thousands of frames while keeping runtime largely insensitive to HR input size.

Quantitative results

Discussion

We introduced DAGE, a dual-stream visual geometry transformer. A low-resolution stream efficiently estimates cameras and enforces cross-view consistency, while a high-resolution stream preserves sharp details; a lightweight adapter fuses them. This decouples resolution from sequence length, supporting 2K inputs and long videos at practical costs. Empirically, DAGE yields sharper pointmaps and outperforms prior video geometry methods. It matches the 3D reconstruction and pose accuracy of state-of-the-art models [95, 102] while running significantly faster.

Limitations. Performance can drop under extremely low overlap or rapid non-rigid motion; the HR path is memory-intensive at very high resolutions; and the current method does not recover dynamic motion.

BibTeX

@inproceedings{ngo2026dage,
  title={DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation},
  author={Ngo, Tuan Duc and Huang, Jiahui and Oh, Seoung Wug and Blackburn-Matzen, Kevin and Kalogerakis, Evangelos and Gan, Chuang and Lee, Joon-Young},
  booktitle={CVPR},
  year={2026}
}

Acknowledgements

We thank Sangbeom Lim, Yihong Sun, Long Mai, Duc Hai Pham, and Minh-Quan Viet Bui for their valuable discussions.

Evangelos Kalogerakis has received funding from the European Research Council (ERC) under the Horizon research and innovation programme (Grant agreement No. 101124742).