A single image becomes a layered 3D world. Across objects, scenes, and dynamic content.
Object
From a single object photo, we predict every layer of geometry the object occupies in front of and behind the visible surface.
See examplesScene
The same predictor scales to full indoor and outdoor scenes — multi-room interiors, facades, and gardens get a layered point cloud per pixel.
See examplesDynamic
For short video clips we predict per-frame layered point clouds that stay 3D-consistent as the subject moves.
See examplesTraining-free textured mesh
Lift our layered point cloud into a textured mesh by plugging it into off-the-shelf mesh generators — no fine-tuning needed.
See examples3D scene editing
Add, replace or remove objects in a scene — the layered geometry keeps every edit consistent across novel views.
See examplesGeometry-guided video
Pair our predicted geometry with a video diffusion model to render temporally-consistent fly-throughs that obey 3D structure.
See examplesAbstract
Single-view to 3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned to the input. We introduce World Tracing, a generative pixel-aligned geometry representation that produces 3D points faithfully reproducing the input image, while containing complete geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation as a world-tracing diffusion transformer (WT-DiT) that treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching using a mixed noise schedule to balance reconstruction vs. generation capability. As a result, World Tracing demonstrates strong performance on both visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. Furthermore, because it preserves 2D-to-3D correspondence, it directly enables text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.
Interactive examples
World Tracing produces pixel-aligned layered geometry from a single image or a monocular video. For each object we additionally lift our point cloud into a training-free textured mesh. Click any thumbnail to load a different example; click the Rerun button below each viewer to open the full layered recording.
Multilayer depth predictions
World Tracing predicts an ordered stack of six camera-space depth layers per pixel. The animation cycles through the layers in turbo colormap: Layer 0 is the visible surface, and deeper layers progressively populate occluded geometry behind near objects. Each card pairs the input image (left) with the full multilayer depth animation (right).


















Training-free textured-mesh generation
For each example we lift our layered point cloud into a textured mesh without any fine-tuning. The video below walks through the full pipeline: input image → first depth layer → all layers → textured mesh. Click any thumbnail to swap the object.
3D scene editing
Our layered geometry treats every object as a fully-realised 3D asset. We can insert, replace, or remove objects in a scene point cloud and the edit propagates consistently through novel views.
Geometry-guided novel-view video
Pair our predicted geometry with a video diffusion model (Wan2.2 VACE) and produce temporally-consistent novel-view flythrough videos that obey the underlying 3D structure.
BibTeX
@misc{zhang2026worldtracing,
title = {World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible},
author = {Hao Zhang and Mohamed El Banani and Jen-Hao Cheng and Paul Zhang
and Yi Hua and Ben Mildenhall and Christoph Lassner
and Narendra Ahuja and Gengshan Yang},
year = {2026},
eprint = {TODO},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
Acknowledgements
We thank Fei-Fei Li, Justin Johnson, Bardienus Duisterhof, Justin Cui, and Zixuan Huang for valuable discussions and feedback throughout this project. The multi-view human capture sequences in the dynamic gallery are sourced from the DNA-Rendering benchmark; the in-the-wild animal clips come from DAVIS and Consistent4D. The interactive 3D viewer is powered by Three.js and <model-viewer>; exported sessions open in Rerun. Page layout inspired by World Labs.