World Tracing — Generative Pixel-Aligned Geometry Beyond the Visible

01What World Tracing can do

A single image becomes a layered 3D world. Across objects, scenes, and dynamic content.

Pixel-aligned multilayer geometry

01

Object

From a single object photo, we predict every layer of geometry the object occupies in front of and behind the visible surface.

See examples

02

Scene

The same predictor scales to full indoor and outdoor scenes — multi-room interiors, facades, and gardens get a layered point cloud per pixel.

See examples

03

Dynamic

For short video clips we predict per-frame layered point clouds that stay 3D-consistent as the subject moves.

See examples

Applications enabled by World Tracing

04

Training-free textured mesh

Lift our layered point cloud into a textured mesh by plugging it into off-the-shelf mesh generators — no fine-tuning needed.

See examples

05

3D scene editing

Add, replace or remove objects in a scene — the layered geometry keeps every edit consistent across novel views.

See examples

06

Geometry-guided video

Pair our predicted geometry with a video diffusion model to render temporally-consistent fly-throughs that obey 3D structure.

See examples

02The paper in one paragraph

Abstract

Single-view to 3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned to the input. We introduce World Tracing, a generative pixel-aligned geometry representation that produces 3D points faithfully reproducing the input image, while containing complete geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation as a world-tracing diffusion transformer (WT-DiT) that treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching using a mixed noise schedule to balance reconstruction vs. generation capability. As a result, World Tracing demonstrates strong performance on both visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. Furthermore, because it preserves 2D-to-3D correspondence, it directly enables text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

03Explore the results

Interactive examples

World Tracing produces pixel-aligned layered geometry from a single image or a monocular video. For each object we additionally lift our point cloud into a training-free textured mesh. Click any thumbnail to load a different example; click the Rerun button below each viewer to open the full layered recording.

04Layered output

Multilayer depth predictions

World Tracing predicts an ordered stack of six camera-space depth layers per pixel. The animation cycles through the layers in turbo colormap: Layer 0 is the visible surface, and deeper layers progressively populate occluded geometry behind near objects. Each card pairs the input image (left) with the full multilayer depth animation (right).

Input

Multilayer depth Multilayer depth: 3D-FRONT bedroom

3D-FRONT bedroom3D-FRONT

Input

Multilayer depth Multilayer depth: 3D-FRONT bedroom

3D-FRONT bedroom3D-FRONT

Input

Multilayer depth Multilayer depth: Modern living room

Modern living roomGenerated

Input

Multilayer depth Multilayer depth: Master bedroom

Master bedroomGenerated

Input

Multilayer depth Multilayer depth: Modern kitchen

Modern kitchenGenerated

Input

Multilayer depth Multilayer depth: Dining room

Dining roomGenerated

Input

Multilayer depth Multilayer depth: Luxury bathroom

Luxury bathroomGenerated

Input

Multilayer depth Multilayer depth: Art studio

Art studioGenerated

Input

Multilayer depth Multilayer depth: Rustic cabin

Rustic cabinGenerated

05Applications

Training-free textured-mesh generation

For each example we lift our layered point cloud into a textured mesh without any fine-tuning. The video below walks through the full pipeline: input image → first depth layer → all layers → textured mesh. Click any thumbnail to swap the object.

06Applications

3D scene editing

Our layered geometry treats every object as a fully-realised 3D asset. We can insert, replace, or remove objects in a scene point cloud and the edit propagates consistently through novel views.

07Applications

Geometry-guided novel-view video

Pair our predicted geometry with a video diffusion model (Wan2.2 VACE) and produce temporally-consistent novel-view flythrough videos that obey the underlying 3D structure.

08Cite this work

BibTeX

@misc{zhang2026worldtracinggenerativepixelaligned,
  title         = {World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible},
  author        = {Hao Zhang and Mohamed El Banani and Jen-Hao Cheng and Paul Zhang
                   and Yi Hua and Ben Mildenhall and Christoph Lassner
                   and Narendra Ahuja and Gengshan Yang},
  year          = {2026},
  eprint        = {2606.13652},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.13652}
}

09Thanks

Acknowledgements

We thank Fei-Fei Li, Justin Johnson, Bardienus Duisterhof, Justin Cui, and Zixuan Huang for valuable discussions and feedback throughout this project. The multi-view human capture sequences in the dynamic gallery are sourced from the DNA-Rendering benchmark; the in-the-wild animal clips come from DAVIS and Consistent4D. The interactive 3D viewer is powered by Three.js and <model-viewer>; exported sessions open in Rerun. Page layout inspired by World Labs.