World Tracing

Generative Pixel-Aligned Geometry Beyond the Visible

Hao Zhang1,2 · Mohamed El Banani1 · Jen-Hao Cheng1 · Paul Zhang1
Yi Hua1 · Ben Mildenhall1 · Christoph Lassner1 · Narendra Ahuja2 · Gengshan Yang1

1World Labs    2University of Illinois Urbana-Champaign

A single image becomes a layered 3D world. Across objects, scenes, and dynamic content.

Pixel-aligned multilayer geometry An ordered stack of 3D points per pixel — visible surfaces and the geometry hidden behind them.
01

Object

From a single object photo, we predict every layer of geometry the object occupies in front of and behind the visible surface.

See examples
02

Scene

The same predictor scales to full indoor and outdoor scenes — multi-room interiors, facades, and gardens get a layered point cloud per pixel.

See examples
03

Dynamic

For short video clips we predict per-frame layered point clouds that stay 3D-consistent as the subject moves.

See examples
Applications enabled by World Tracing The layered geometry plugs into existing 3D and video models — no retraining needed.
04

Training-free textured mesh

Lift our layered point cloud into a textured mesh by plugging it into off-the-shelf mesh generators — no fine-tuning needed.

See examples
05

3D scene editing

Add, replace or remove objects in a scene — the layered geometry keeps every edit consistent across novel views.

See examples
06

Geometry-guided video

Pair our predicted geometry with a video diffusion model to render temporally-consistent fly-throughs that obey 3D structure.

See examples

Abstract

Single-view to 3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned to the input. We introduce World Tracing, a generative pixel-aligned geometry representation that produces 3D points faithfully reproducing the input image, while containing complete geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation as a world-tracing diffusion transformer (WT-DiT) that treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching using a mixed noise schedule to balance reconstruction vs. generation capability. As a result, World Tracing demonstrates strong performance on both visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. Furthermore, because it preserves 2D-to-3D correspondence, it directly enables text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

Interactive examples

World Tracing produces pixel-aligned layered geometry from a single image or a monocular video. For each object we additionally lift our point cloud into a training-free textured mesh. Click any thumbnail to load a different example; click the Rerun button below each viewer to open the full layered recording.

Multilayer depth predictions

World Tracing predicts an ordered stack of six camera-space depth layers per pixel. The animation cycles through the layers in turbo colormap: Layer 0 is the visible surface, and deeper layers progressively populate occluded geometry behind near objects. Each card pairs the input image (left) with the full multilayer depth animation (right).

InputInput: 3D-FRONT bedroom
Multilayer depthMultilayer depth: 3D-FRONT bedroom
3D-FRONT bedroom3D-FRONT
InputInput: 3D-FRONT bedroom
Multilayer depthMultilayer depth: 3D-FRONT bedroom
3D-FRONT bedroom3D-FRONT
InputInput: Modern living room
Multilayer depthMultilayer depth: Modern living room
Modern living roomGenerated
InputInput: Master bedroom
Multilayer depthMultilayer depth: Master bedroom
Master bedroomGenerated
InputInput: Modern kitchen
Multilayer depthMultilayer depth: Modern kitchen
Modern kitchenGenerated
InputInput: Dining room
Multilayer depthMultilayer depth: Dining room
Dining roomGenerated
InputInput: Luxury bathroom
Multilayer depthMultilayer depth: Luxury bathroom
Luxury bathroomGenerated
InputInput: Art studio
Multilayer depthMultilayer depth: Art studio
Art studioGenerated
InputInput: Rustic cabin
Multilayer depthMultilayer depth: Rustic cabin
Rustic cabinGenerated

Training-free textured-mesh generation

For each example we lift our layered point cloud into a textured mesh without any fine-tuning. The video below walks through the full pipeline: input image → first depth layer → all layers → textured mesh. Click any thumbnail to swap the object.

3D scene editing

Our layered geometry treats every object as a fully-realised 3D asset. We can insert, replace, or remove objects in a scene point cloud and the edit propagates consistently through novel views.

Geometry-guided novel-view video

Pair our predicted geometry with a video diffusion model (Wan2.2 VACE) and produce temporally-consistent novel-view flythrough videos that obey the underlying 3D structure.

BibTeX

@misc{zhang2026worldtracing,
  title         = {World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible},
  author        = {Hao Zhang and Mohamed El Banani and Jen-Hao Cheng and Paul Zhang
                   and Yi Hua and Ben Mildenhall and Christoph Lassner
                   and Narendra Ahuja and Gengshan Yang},
  year          = {2026},
  eprint        = {TODO},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}

Acknowledgements

We thank Fei-Fei Li, Justin Johnson, Bardienus Duisterhof, Justin Cui, and Zixuan Huang for valuable discussions and feedback throughout this project. The multi-view human capture sequences in the dynamic gallery are sourced from the DNA-Rendering benchmark; the in-the-wild animal clips come from DAVIS and Consistent4D. The interactive 3D viewer is powered by Three.js and <model-viewer>; exported sessions open in Rerun. Page layout inspired by World Labs.