VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

Amodal 3D Reconstruction from a Single Image

TL;DR VolFill recovers the complete 3D scene geometry from a single RGB image — including occluded surfaces hidden behind the visible geometry.

Abstract

Reconstructing the complete geometry of a scene from a single RGB image remains challenging — especially when inferring hidden structures where visual evidence is incomplete. We introduce VolFill, a generative framework that predicts the 3D structure of the complete scene rather than relying on traditional pixel-aligned regression. Our method utilizes a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into a compact latent space, paired with a latent Diffusion Transformer that denoises this representation to recover the complete scene. We condition the generation on geometry foundation models, leveraging rich spatial priors for robust reasoning. Unlike existing methods limited by per-ray constraints or unstructured point-cloud queries, VolFill provides a structured representation that supports direct surface extraction and occupancy queries at scale. Extensive experiments on the SCRREAM and NRGB-D datasets demonstrate that our approach significantly outperforms current baselines, providing a robust foundation for holistic spatial understanding.

Contributions

Volumetric Flow Matching

VolFill is a generative framework that uses volumetric flow matching to recover complete, scene-level 3D geometry — including occluded surfaces — directly from a single RGB image.

Hybrid 3D VAE

A hybrid 3D VAE compresses high-resolution TUDF grids into a compact latent space, enabling efficient yet high-fidelity reconstruction of complex amodal structures.

Dual-Conditioning Strategy

A dual-conditioning strategy leverages geometry foundation models, fusing high-level image tokens with explicit visible geometry to guide robust amodal reasoning.

Task

Single-view amodal 3D scene reconstruction. Given one RGB image $\mathbf{I} \in \mathbb{R}^{H \times W \times 3}$ of an indoor scene, we seek to recover the complete scene geometry $S$ within the camera frustum — including surfaces behind the visible foreground.

Method

We represent the scene as a Truncated Unsigned Distance Function (TUDF) $V \in \mathbb{R}^{N \times N \times N}$ at resolution $N = 256$, where each voxel stores the distance to the nearest physical surface $\mathcal{S}$ clipped at a maximum value $\tau$: $V(\mathbf{p}) = \min(\mathrm{dist}(\mathbf{p}, \mathcal{S}), \tau)$.

Rather than predicting $V$ as a deterministic regression problem, we formulate amodal reconstruction as estimating the conditional distribution $P(V \mid I)$, realized by a hybrid 3D VAE that compresses $V$ into a compact latent space and a latent diffusion transformer trained with flow matching, conditioned on geometric priors from a frozen foundation model.

Hybrid 3D VAE

Encoder $\mathcal{E}$. Compresses the TUDF into a compact latent via sparse 3D convolutions. Because only voxels near surfaces carry meaningful values, the grid is naturally sparse and tolerates aggressive downsampling.
Decoder $\mathcal{D}$. Uses a hybrid dense-to-sparse design: dense convolutions at low resolution, an occupancy head at the middle stage that sparsifies the feature map by predicting which voxels are surface-adjacent, then sparse convolutions to reach full resolution — keeping memory and compute proportional to surface area, not scene volume.
Training objective. $\mathcal{L}_{\text{VAE}} = \mathcal{L}_1(\hat V, V_{\text{gt}}) + \lambda_{\text{bce}} \cdot \mathrm{BCE}(\hat O, O_{\text{gt}}) + \lambda_{\text{dice}} \cdot \mathrm{Dice}(\hat O, O_{\text{gt}}) + \lambda_{\text{kl}} \cdot \mathcal{L}_{\text{KL}}$ — distance regression on active voxels, occupancy supervision, and KL regularization.

Latent DiT with Flow Matching

Latent DiT and conditioning architecture

① Global geometric prior

Frozen MoGe2 features serve as image tokens fed into the DiT via cross-attention, providing rich monocular geometric priors without fine-tuning the foundation model.

② Visible-latent inpainting

The MoGe2 visible pointmap is converted to a visible-only TUDF and encoded into a latent $\mathbf{z}_{\text{vis}}$, then added into the noisy latent through a zero-initialized projection $\tilde{\mathbf{z}}_t = \mathbf{z}_t + \mathrm{MLP}_0(\mathbf{z}_{\text{vis}})$, anchoring generation to the observed geometry.

Backbone. A latent Diffusion Transformer $\Phi$ that operates in the VAE bottleneck space, with the two conditioning paths above.
Flow matching. A rectified-flow objective with a linear noise-to-data path $\mathbf{z}_t = (1 - t)\,\boldsymbol{\epsilon} + t\,\mathbf{z}_1$ and velocity target $\mathbf{u}^\star = \mathbf{z}_1 - \boldsymbol{\epsilon}$; the DiT is trained to regress the velocity.
Inference. Sample $\mathbf{z}_0 \sim \mathcal{N}(0, I)$ and integrate the learned ODE with Euler steps + classifier-free guidance to a clean latent $\mathbf{z}_1$, then decode through $\mathcal{D}$ to obtain the final TUDF $V_{\text{pred}}$.

Training Details

Datasets

We train on synthetic 3D-FRONT (96k image–TUDF pairs) and real-world ScanNet++ (46k samples), following the data splits of LaRI and NOVA3R.

Training stages

Both stages use AdamW with mixed-precision training on two A6000 / L40S GPUs:

Stage 1 — Hybrid 3D VAE. Compresses the TUDF grids into the compact latent space.
Stage 2 — Latent DiT. Generates the latent via flow matching, with classifier-free guidance at inference.

Qualitative Results

Interactive comparison

A selected baseline (middle) is shown side-by-side with VolFill (right). Pick a scene from row A; pick which baseline to compare against in row B. Cameras are synced — drag either viewer to orbit both.

Input image

NOVA3R

VolFill (ours)

A · Scene

B · Compare with

Comparison with amodal baselines

Qualitative comparison against MoGe2, LaRI, and NOVA3R on SCRREAM and NRGB-D — **Figure 4 (paper).** VolFill synthesizes sharp, high-fidelity amodal geometry, whereas LaRI produces layered artifacts (red circles) and holes, and NOVA3R yields noisy, unstructured point scatters (green circles).

Comparison with pixel-aligned approaches

Point cloud and mesh quality

Point cloud and mesh comparison against LaRI and NOVA3R — **Figure 9 (paper).** Side-by-side point clouds and meshes for each method. The structured TUDF grid lets VolFill extract clean, topologically consistent meshes directly via isosurfacing, while LaRI and NOVA3R rely on Poisson reconstruction over unstructured points and produce fragmented, artifact-heavy meshes.

Quantitative Results

We report Chamfer Distance (CD ↓, ×10²), one-way coverage APD_γ (↑) on the visible and occluded subsets, and F-score FS_γ (↑) on the complete geometry. Thresholds γ ∈ {0.10, 0.02} are used on SCRREAM and γ ∈ {0.10, 0.05} on NRGB-D. Fréchet Point Cloud Distance (FPD ↓) is computed with a pretrained Uni3D feature extractor. Predictions and GT are rigidly aligned (scale + rotation + translation) before scoring. Best and second-best are highlighted.

Table 1 — 3D Reconstruction on SCRREAM

Method	Visible			Occluded			Complete
Method	CD ↓	APD_0.1 ↑	APD_0.02 ↑	CD ↓	APD_0.1 ↑	APD_0.02 ↑	CD ↓	FS_0.1 ↑	FS_0.02 ↑
Object-level generative models^†
TripoSG^†	13.75	44.74	10.08	10.73	55.34	13.37	11.10	53.56	13.30
TRELLIS^†	11.63	54.05	15.00	9.65	60.70	17.52	10.72	56.69	15.59
Visible-surface baselines
VGGT	3.04	96.98	45.62	17.40	44.60	10.04	6.77	80.45	36.77
DepthAnything3	2.37	98.98	54.34	16.16	48.31	10.76	6.16	82.86	43.10
DepthPro	3.85	95.10	33.30	15.31	48.66	9.71	6.90	80.36	28.44
MoGe2	2.28	99.38	57.42	15.29	50.09	10.30	5.74	83.96	45.84
Scene-level amodal reconstruction
LaRI	3.43	95.48	41.61	5.23	85.25	29.31	5.19	85.43	30.85
NOVA3R	3.20	96.77	43.75	3.56	94.86	41.21	3.43	95.26	45.12
VolFill (ours)	2.84	96.10	56.73	3.46	92.70	55.53	3.03	95.03	54.83

^† Trained on object-centric datasets; not designed for full scene reconstruction. All numbers scaled by ×10².

Table 2 — 3D Reconstruction on NRGB-D

Method	Visible			Occluded			Complete
Method	CD ↓	APD_0.1 ↑	APD_0.05 ↑	CD ↓	APD_0.1 ↑	APD_0.05 ↑	CD ↓	FS_0.1 ↑	FS_0.05 ↑
Visible-surface baselines
VGGT	3.25	96.51	81.63	21.38	42.15	26.35	9.44	73.85	58.91
DepthAnything3	2.69	98.68	89.24	21.65	43.34	28.15	8.88	74.87	63.52
DepthPro	4.38	92.56	67.41	22.19	41.69	24.64	10.03	71.30	51.60
MoGe2	2.62	98.46	89.55	21.12	43.31	28.43	9.31	75.35	61.93
Scene-level amodal reconstruction
LaRI	4.09	93.77	73.51	13.04	64.53	45.82	9.79	68.37	43.47
NOVA3R	5.05	89.63	64.44	8.31	80.50	58.55	8.19	73.99	49.92
VolFill (ours)	4.52	90.06	69.22	6.97	81.41	62.48	6.44	84.35	63.92

Table 3 — Metric Geometry Evaluation

Evaluation on the metric-scale frame (no Sim(3) alignment). Distributional similarity is measured by Fréchet Point Cloud Distance using a pretrained Uni3D feature extractor.

Method	CD ↓	FS_0.02 ↑	FPD ↓
LaRI	11.43	11.47	—
NOVA3R	6.56	22.23	16.50
VolFill (ours)	3.88	44.90	4.86

Ablations

Conditioning strategy

Relying only on the visible latent (① in the figure) fails to recover unobserved structure; image-only conditioning with DINOv2 (②) or MoGe2 (③) hallucinates rough layouts but loses fine detail. The dual setup (④) — MoGe2 cross-attention plus additive injection of the visible latent — yields the sharpest, most complete geometry, with quantitative agreement in Table 5 below.

Table 5 — Conditioning ablation.

Config.	F_GFM	z_vis	CD ↓	FS_0.02 ↑
Single	✗	Add	6.80	31.72
Single	MoGe2	✗	3.81	46.80
Dual	MoGe2	Concat	3.66	48.94
Dual	MoGe2	Add	3.03	54.83

Foundation model

Replacing MoGe2 with the semantic-only DINOv2 backbone collapses reconstruction quality — semantic tokens alone lack the 3D spatial anchors the DiT needs. VGGT improves over DINOv2 but still lags MoGe2, indicating that structural consistency from a geometry-trained backbone is a stronger prior than general semantic reasoning for amodal completion.

Table 6 — Foundation model ablation.

Conditioning	CD ↓	FS_0.02 ↑
DINOv2	5.46	35.98
VGGT	4.32	42.30
MoGe2	3.81	46.80

Limitations

Residual holes in unobserved regions. While VolFill produces sharper geometry with fewer artifacts than NOVA3R and LaRI, small holes can occasionally persist behind heavily occluded surfaces — synthesizing geometry where no visual evidence exists remains fundamentally underdetermined. Scaling model capacity and data diversity is a natural path to closing the remaining topological gaps.
Iterative inference cost. As a generative framework, VolFill runs a 50-step Euler ODE solver with classifier-free guidance, which takes ~1.4 s on an RTX 4090 — still slower than single-pass regression baselines. Step distillation and consistency-style training are promising directions for faster sampling.

BibTeX

@article{ngo2026volfill,
  title   = {VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching},
  author  = {Ngo, Tuan Duc and Gan, Chuang and Kalogerakis, Evangelos},
  journal = {arXiv preprint},
  year    = {2026}
}

References

Wang et al., MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details, NeurIPS 2025.
Lin et al., Depth Anything 3: Recovering the Visual Space from Any Views, ICLR 2026.
Bochkovskii et al., Depth Pro: Sharp Monocular Metric Depth in Less Than a Second, ICLR 2025.
Wang et al., VGGT: Visual Geometry Grounded Transformer, CVPR 2025.
Li et al., LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning, arXiv 2025.
Chen et al., NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction, ICLR 2026.
Li et al., TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models, arXiv 2025.
Xiang et al., Structured 3D Latents for Scalable and Versatile 3D Generation (TRELLIS), CVPR 2025.
Jung et al., SCRREAM: SCan, Register, REnder And Map — A Framework for Annotating Accurate and Dense 3D Indoor Scenes with a Benchmark, NeurIPS 2024.
Azinović et al., Neural RGB-D Surface Reconstruction (NRGB-D), CVPR 2022.
Zhou et al., Uni3D: Exploring Unified 3D Representation at Scale, ICLR 2024.