GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

1Technical University of Munich, 2Computing Systems Lab, Huawei Technologies, Switzerland

Given casually captured smartphone videos, GenRecon produces complete, PBR-ready mesh reconstructions of indoor environments.

Abstract

We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D prior. We cast scene reconstruction as conditional 3D generation over a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models -- we use Trellis.2 as an example -- which we generalize to the scene level. To this end, we propose a projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior of Trellis.2 to multi-view, scene-scale generation, producing faithful, editable PBR mesh reconstructions of indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.

GenRecon fuses a 3D generative prior into multi-view reconstruction to reconstruct complete, editable 3D scene meshes from a sparse set of RGB images.

Video

Method Overview

GenRecon Pipeline

Spacially Grounded 3D Conditioning Pathway. Standard 3D generative models take a single image and lack explicit pose control, making them unsuitable for multi-view scene reconstruction. We address this by lifting features from each input view into a shared 3D voxel grid via camera projection and aggregating them into a unified, view-order-invariant representation, so every conditioning signal is tied to a precise 3D location and directly guides pose-consistent generation.

Scene Reconstruction Pipeline. At inference time, we first recover camera poses from the input images and partition the scene into overlapping 3D chunks. We then construct a global 3D conditioning grid from all views and generate all chunks jointly in a shared latent space, enforcing consistency across chunk boundaries throughout the generative process. The resulting fused latent is finally decoded into a complete, high-quality scene mesh with PBR materials.

Results

Relighting

Given a sparse set of RGB images, GenRecon reconstructs a high-fidelity PBR mesh enabling realistic relighting and editing in standard rendering pipelines.

Comparisons

Drag the slider to compare each method (left) against the scan (right).

Scan
Ours
Scan
2DGS
Scan
DA3
Scan
FineRecon
Scan
MonoSDF
Scan
Murre

While the baslines produce noisy or oversmooth surfaces for challenging areas and are incomplete in occluded and unobserved areas, our approach yields complete and high-fidelity reconstructions from 8 input images.

Related Links

Our method builds on the object-level 3D generation model Trellis.2.

The concurrent work Pixal3D adopts a closely related conditioning strategy for 3D object-level generation.

BibTeX

@article{schmid2026genreconbridginggenerativepriors,
  author={Schmid, Katharina and von L{\"u}tzow, Nicolas and Hladk\'y, Jozef and Dai, Angela and Nie{\ss}ner, Matthias},
  title={GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction}, 
  year={2026},
  eprint={2605.23888},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2605.23888},
}