ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing

Abstract

In the field of 3D content generation, single image scene reconstruction methods still struggle to simultaneously ensure the quality of individual assets and the coherence of the overall scene in complex environments, while texture editing techniques often fail to maintain both local continuity and multi-view consistency. In this paper, we propose a novel system ZeroScene, which leverages the prior knowledge of large vision models to accomplish both single image-to-3D scene reconstruction and texture editing in a zero-shot manner. ZeroScene extracts object-level 2D segmentation and depth information from input images to infer spatial relationships within the scene. It then jointly optimizes 3D and 2D projection losses of the point cloud to update object poses for precise scene alignment, ultimately constructing a coherent and complete 3D scene that encompasses both foreground and background. Moreover, ZeroScene supports texture editing of objects in the scene. By imposing constraints on the diffusion model and introducing a mask-guided progressive image generation strategy, we effectively maintain texture consistency across multiple viewpoints and further enhance the realism of rendered results through Physically Based Rendering (PBR) material estimation. Experimental results demonstrate that our framework not only ensures the geometric and appearance accuracy of generated assets, but also faithfully reconstructs scene layouts and produces highly detailed textures that closely align with text prompts. Leveraging generative artificial intelligence, ZeroScene can transform 2D images into diversified 3D worlds with various styles, showing broad application potential in virtual content creation, such as digital twins and immersive game production, while also effectively support "real-to-sim" transfer in robotics through the generation of highly realistic and trainable simulation environments.

3D Scene Generation

Overview of 3D Scene Generation. We decouple the foreground and background of a given image. The assembly of foreground objects is achieved through three steps: instance segmentation and generation, scene point cloud extraction, and layout optimization. For the background environment, we fit planes from point clouds with color information. Finally, the foreground and background are integrated to construct a complete 3D scene that is multi-view consistent and spatially coherent.

Qualitative comparisons between ZeroScene and state-of-the-art
single-image scene generation methods

More scenes (including background) generated by ZeroScene




Texture Editing

Overview of Texture Editing. We utilize generated images for texture synthesis to enable editing. Given a mesh, we render its geometry-aware conditions, which are then injected into a diffusion model along with a text prompt. After obtaining a single-view image aligned with the geometric structure, a mask-guided progressive image generation strategy is employed to synthesize a sequence of RGB images with multiview consistency. The resulting image set is preprocessed with lighting elimination and super-resolution, after which texture is synthesized via a back projection module. Finally, PBR material estimation is incorporated to enhance rendering realism.

Qualitative comparisons between ZeroScene and state-of-the-art
text-to-texture generation methods

Additional qualitative comparison results of texture editing