arXiv 2026

VolFill logo VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

Tuan Duc Ngo1 Chuang Gan1 Evangelos Kalogerakis1,2

1University of Massachusetts Amherst
2Technical University of Crete
πŸ“„ Paper πŸ—‚ arXiv πŸ’» Code πŸ“‘ BibTeX

Amodal 3D Reconstruction from a Single Image

TL;DR VolFill recovers the complete 3D scene geometry from a single RGB image β€” including occluded surfaces hidden behind the visible geometry.

Abstract

Reconstructing the complete geometry of a scene from a single RGB image remains challenging β€” especially when inferring hidden structures where visual evidence is incomplete. We introduce VolFill, a generative framework that predicts the 3D structure of the complete scene rather than relying on traditional pixel-aligned regression. Our method utilizes a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into a compact latent space, paired with a latent Diffusion Transformer that denoises this representation to recover the complete scene. We condition the generation on geometry foundation models, leveraging rich spatial priors for robust reasoning. Unlike existing methods limited by per-ray constraints or unstructured point-cloud queries, VolFill provides a structured representation that supports direct surface extraction and occupancy queries at scale. Extensive experiments on the SCRREAM and NRGB-D datasets demonstrate that our approach significantly outperforms current baselines, providing a robust foundation for holistic spatial understanding.

Contributions

Volumetric Flow Matching

VolFill is a generative framework that uses volumetric flow matching to recover complete, scene-level 3D geometry β€” including occluded surfaces β€” directly from a single RGB image.

Hybrid 3D VAE

A hybrid 3D VAE compresses high-resolution TUDF grids into a compact latent space, enabling efficient yet high-fidelity reconstruction of complex amodal structures.

Dual-Conditioning Strategy

A dual-conditioning strategy leverages geometry foundation models, fusing high-level image tokens with explicit visible geometry to guide robust amodal reasoning.

Task

Single-view amodal 3D scene reconstruction. Given one RGB image $\mathbf{I} \in \mathbb{R}^{H \times W \times 3}$ of an indoor scene, we seek to recover the complete scene geometry $S$ within the camera frustum β€” including surfaces behind the visible foreground.

Method

We represent the scene as a Truncated Unsigned Distance Function (TUDF) $V \in \mathbb{R}^{N \times N \times N}$ at resolution $N = 256$, where each voxel stores the distance to the nearest physical surface $\mathcal{S}$ clipped at a maximum value $\tau$: $V(\mathbf{p}) = \min(\mathrm{dist}(\mathbf{p}, \mathcal{S}), \tau)$.

Rather than predicting $V$ as a deterministic regression problem, we formulate amodal reconstruction as estimating the conditional distribution $P(V \mid I)$, realized by a hybrid 3D VAE that compresses $V$ into a compact latent space and a latent diffusion transformer trained with flow matching, conditioned on geometric priors from a frozen foundation model.

Hybrid 3D VAE

Hybrid 3D VAE architecture

Latent DiT with Flow Matching

Latent DiT and conditioning architecture

β‘  Global geometric prior

Frozen MoGe2 features serve as image tokens fed into the DiT via cross-attention, providing rich monocular geometric priors without fine-tuning the foundation model.

β‘‘ Visible-latent inpainting

The MoGe2 visible pointmap is converted to a visible-only TUDF and encoded into a latent $\mathbf{z}_{\text{vis}}$, then added into the noisy latent through a zero-initialized projection $\tilde{\mathbf{z}}_t = \mathbf{z}_t + \mathrm{MLP}_0(\mathbf{z}_{\text{vis}})$, anchoring generation to the observed geometry.

Training Details

Datasets

We train on synthetic 3D-FRONT (96k image–TUDF pairs) and real-world ScanNet++ (46k samples), following the data splits of LaRI and NOVA3R.

Training stages

Both stages use AdamW with mixed-precision training on two A6000 / L40S GPUs:

Qualitative Results

Interactive comparison

A selected baseline (middle) is shown side-by-side with VolFill (right). Pick a scene from row A; pick which baseline to compare against in row B. Cameras are synced β€” drag either viewer to orbit both.

Input image
Input RGB image
NOVA3R
VolFill (ours)
A Β· Scene
B Β· Compare with

Comparison with amodal baselines

Qualitative comparison against MoGe2, LaRI, and NOVA3R on SCRREAM and NRGB-D
Figure 4 (paper). VolFill synthesizes sharp, high-fidelity amodal geometry, whereas LaRI produces layered artifacts (red circles) and holes, and NOVA3R yields noisy, unstructured point scatters (green circles).

Comparison with pixel-aligned approaches

Qualitative comparison with pixel-aligned approaches MoGe2 and DepthAnything3
Figure 8 (paper). Pixel-aligned methods (MoGe2, DepthAnything3) recover only visible surfaces and leave significant holes behind occluders, whereas VolFill reconstructs complete, physically plausible amodal geometry that closely matches the ground truth.

Point cloud and mesh quality

Point cloud and mesh comparison against LaRI and NOVA3R
Figure 9 (paper). Side-by-side point clouds and meshes for each method. The structured TUDF grid lets VolFill extract clean, topologically consistent meshes directly via isosurfacing, while LaRI and NOVA3R rely on Poisson reconstruction over unstructured points and produce fragmented, artifact-heavy meshes.

Quantitative Results

We report Chamfer Distance (CD ↓, Γ—102), one-way coverage APDΞ³ (↑) on the visible and occluded subsets, and F-score FSΞ³ (↑) on the complete geometry. Thresholds Ξ³ ∈ {0.10, 0.02} are used on SCRREAM and Ξ³ ∈ {0.10, 0.05} on NRGB-D. FrΓ©chet Point Cloud Distance (FPD ↓) is computed with a pretrained Uni3D feature extractor. Predictions and GT are rigidly aligned (scale + rotation + translation) before scoring. Best and second-best are highlighted.

Table 1 β€” 3D Reconstruction on SCRREAM

Method Visible Occluded Complete
CD ↓APD0.1 ↑APD0.02 ↑ CD ↓APD0.1 ↑APD0.02 ↑ CD ↓FS0.1 ↑FS0.02 ↑
Object-level generative models†
TripoSG† 13.7544.7410.08 10.7355.3413.37 11.1053.5613.30
TRELLIS† 11.6354.0515.00 9.6560.7017.52 10.7256.6915.59
Visible-surface baselines
VGGT 3.0496.9845.62 17.4044.6010.04 6.7780.4536.77
DepthAnything3 2.3798.9854.34 16.1648.3110.76 6.1682.8643.10
DepthPro 3.8595.1033.30 15.3148.669.71 6.9080.3628.44
MoGe2 2.2899.3857.42 15.2950.0910.30 5.7483.9645.84
Scene-level amodal reconstruction
LaRI 3.4395.4841.61 5.2385.2529.31 5.1985.4330.85
NOVA3R 3.2096.7743.75 3.5694.8641.21 3.4395.2645.12
VolFill (ours) 2.8496.1056.73 3.4692.7055.53 3.0395.0354.83

† Trained on object-centric datasets; not designed for full scene reconstruction. All numbers scaled by Γ—102.

Table 2 β€” 3D Reconstruction on NRGB-D

Method Visible Occluded Complete
CD ↓APD0.1 ↑APD0.05 ↑ CD ↓APD0.1 ↑APD0.05 ↑ CD ↓FS0.1 ↑FS0.05 ↑
Visible-surface baselines
VGGT 3.2596.5181.63 21.3842.1526.35 9.4473.8558.91
DepthAnything3 2.6998.6889.24 21.6543.3428.15 8.8874.8763.52
DepthPro 4.3892.5667.41 22.1941.6924.64 10.0371.3051.60
MoGe2 2.6298.4689.55 21.1243.3128.43 9.3175.3561.93
Scene-level amodal reconstruction
LaRI 4.0993.7773.51 13.0464.5345.82 9.7968.3743.47
NOVA3R 5.0589.6364.44 8.3180.5058.55 8.1973.9949.92
VolFill (ours) 4.5290.0669.22 6.9781.4162.48 6.4484.3563.92

Table 3 β€” Metric Geometry Evaluation

Evaluation on the metric-scale frame (no Sim(3) alignment). Distributional similarity is measured by FrΓ©chet Point Cloud Distance using a pretrained Uni3D feature extractor.

Method CD ↓ FS0.02 ↑ FPD ↓
LaRI 11.4311.47β€”
NOVA3R 6.5622.2316.50
VolFill (ours) 3.8844.904.86

Ablations

Conditioning strategy

Relying only on the visible latent (β‘  in the figure) fails to recover unobserved structure; image-only conditioning with DINOv2 (β‘‘) or MoGe2 (β‘’) hallucinates rough layouts but loses fine detail. The dual setup (β‘£) β€” MoGe2 cross-attention plus additive injection of the visible latent β€” yields the sharpest, most complete geometry, with quantitative agreement in Table 5 below.

Conditioning ablation

Table 5 β€” Conditioning ablation.

Config. FGFM zvis CD ↓ FS0.02 ↑
Single βœ— Add 6.80 31.72
MoGe2 βœ— 3.81 46.80
Dual MoGe2 Concat 3.66 48.94
MoGe2 Add 3.03 54.83

Foundation model

Replacing MoGe2 with the semantic-only DINOv2 backbone collapses reconstruction quality β€” semantic tokens alone lack the 3D spatial anchors the DiT needs. VGGT improves over DINOv2 but still lags MoGe2, indicating that structural consistency from a geometry-trained backbone is a stronger prior than general semantic reasoning for amodal completion.

Table 6 β€” Foundation model ablation.

Conditioning CD ↓ FS0.02 ↑
DINOv2 5.46 35.98
VGGT 4.32 42.30
MoGe2 3.81 46.80

Limitations

BibTeX

@article{ngo2026volfill,
  title   = {VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching},
  author  = {Ngo, Tuan Duc and Gan, Chuang and Kalogerakis, Evangelos},
  journal = {arXiv preprint},
  year    = {2026}
}

References

  1. Wang et al., MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details, NeurIPS 2025.
  2. Lin et al., Depth Anything 3: Recovering the Visual Space from Any Views, ICLR 2026.
  3. Bochkovskii et al., Depth Pro: Sharp Monocular Metric Depth in Less Than a Second, ICLR 2025.
  4. Wang et al., VGGT: Visual Geometry Grounded Transformer, CVPR 2025.
  5. Li et al., LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning, arXiv 2025.
  6. Chen et al., NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction, ICLR 2026.
  7. Li et al., TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models, arXiv 2025.
  8. Xiang et al., Structured 3D Latents for Scalable and Versatile 3D Generation (TRELLIS), CVPR 2025.
  9. Jung et al., SCRREAM: SCan, Register, REnder And Map β€” A Framework for Annotating Accurate and Dense 3D Indoor Scenes with a Benchmark, NeurIPS 2024.
  10. Azinović et al., Neural RGB-D Surface Reconstruction (NRGB-D), CVPR 2022.
  11. Zhou et al., Uni3D: Exploring Unified 3D Representation at Scale, ICLR 2024.