Have you ever wondered how TRELLIS 2 can generate a 3D model from a flat photo? More amazingly, why does it work so much better than other tools?

Today I'll explain TRELLIS 2's core technology in plain terms. Don't worry, I promise not to throw incomprehensible research papers at you, just use analogies and real examples to thoroughly explain concepts like O-Voxel, SLAT, and two-stage generation.

First, A Question: Why Is 3D Generation So Hard?

Imagine turning a photo into a 3D sculpture. The traditional approach goes like this:

Guess the shape: Infer object depth and shape from photo (Problem: photos are flat, how to guess the back?)

Build framework: Construct a skeleton with points and lines (like building a sculpture frame with wire)

Apply clay: Fill in textures and details onto the frame (like applying clay to wire frame)

Sounds reasonable, but big problems: Slow (each step requires precise calculations, potentially hours or days), Poor quality (can't guess the back accurately, often breaks down), Inflexible (can't do hollow structures, transparent materials don't work).

TRELLIS 2's revolution lies in: it uses a completely different approach.

Core Innovation One: O-Voxel (Omnipotent Voxel)

What Is Voxel? Start With Pixels

You know images are made of pixels, a 1920x1080 image is 2+ million small squares pieced together, each with its own color.

Voxel is the 3D version of pixel - dividing 3D space into countless small cubes, each with its own properties.

Traditional Voxel's Fatal Problem

Early 3D generation also used Voxels, but had a critical flaw: excessive information.

Imagine a 1024x1024x1024 3D grid, that's 1 billion small cubes! Each storing color and material, computer crashes immediately.

So everyone compromised: either lower resolution (lose details) or only store surface (can't handle hollow structures).

That's why previous 3D generation models, even generating a cup results in solid blocks or messy interiors.

O-Voxel: Sparse Plus Smart Dual Revolution

TRELLIS 2's O-Voxel has two key features:

Feature One: Sparse Storage (Store only useful parts)

Like an image that's mostly white, you don't need to store every white pixel, just store the colored parts. O-Voxel only stores space actually occupied by objects, skipping air parts. This way, originally 1 billion cubes might only need to store a few hundred thousand, compressed thousands of times!

Feature Two: Omnipotent Attributes (Not just placeholders)

Traditional Voxels only mark "something here," like placeholders. Each O-Voxel cube comes with a complete information package: Geometric info (what's the shape here), Surface properties (color, roughness, metallic, transparency - complete PBR materials), Topology info (relationship with surrounding cubes, supports hollow, thin walls, non-manifold).

Analogy: Traditional Voxel equals simple building block placeholders, O-Voxel equals smart modular furniture, each piece with its own function, flexibly combinable.

Actual Effect: Why Can It Handle Hollow Structures?

Why can't traditional Voxels do hollow? Because they either mark "object present" or "object absent," can't express "this is thin wall, that is cavity."

O-Voxel can: Precisely mark each region's properties, understand this block is railing body (metal), that block is middle cavity (air), can even mark railing thickness is only 2mm.

Result? Generated railing, every bar crystal clear, and truly hollow, not solid blocks.

Core Innovation Two: SLAT (Structured Latent Representation)

Problem: O-Voxel Is Still Too Big

Even with sparse storage, hundreds of thousands of smart cubes are still massive data for AI models. Direct training? Computing power can't handle it.

This needs SLAT (Structured LATent) - a super-efficient compression system.

What Is SLAT? Imagine A Super Packing Machine

You have a pile of LEGO bricks to mail to a distant friend. You wouldn't pack them one by one, right? You would: Classify and pack (group same-type bricks together), Compress space (remove extra air bubbles, keep only core structure), Mark positions (record each pack's position relation in original model).

SLAT does exactly this! It compresses 1024³ O-Voxel grid into about 9600 latent tokens, compression ratio up to 16x, but almost no visual quality loss.

Why Called Structured?

Key lies in preserving spatial relationships. Traditional compression scrambles order (like dumping puzzle pieces into a bag), SLAT ensures: Originally adjacent Voxels remain close in compressed latent space. Compressed structure is still a 3D grid, can directly decode back to Mesh or Gaussian or NeRF.

Analogy: Traditional compression equals dumping puzzle pieces into bag (position info lost), SLAT equals flattening puzzle sections while keeping area relations (position info preserved).

This is why TRELLIS 2 generated models have such clean topology - spatial relationships are clear from start to finish.

Two-Stage Generation: Build Skeleton First, Fill Details Second

TRELLIS 2's generation process takes two steps.

Stage One: Sparse Structure Generation - Quick Outline

Goal: Quickly determine object's rough outline.

Input an image, model first generates a sparse Voxel grid, marking where's object, where's air.

Analogy: Like sculptor first sketching rough outlines, not caring about details, just determining posture and proportion.

Why this step? Because directly generating complete high-resolution model is too difficult. First determining skeleton avoids major revisions later (imagine sculpting halfway finding posture wrong, then wasted).

Actual time: About 5 seconds (512³ resolution)

Stage Two: SLAT Generation - Fill Details

Goal: Fill geometric details and surface materials on skeleton foundation.

This step is the heavy lifting: Use stage one sparse structure as constraint condition, Generate SLAT representation (fill each Voxel's complete attributes), Use visual foundation models (like DINOv2) to infer invisible parts (backside, internal structure).

Analogy: Sculptor carving details on outline foundation, coloring, polishing surface.

Actual time: About 12 seconds (1024³ resolution, H100 GPU)

Why Not One Step?

Two-step benefits: High efficiency (stage one rough but fast, locks main direction. Stage two fine but guided), Good quality (with skeleton constraint, won't get monster with big head and small legs), Flexible (can replace only stage two, change materials but keep shape).

Actual Example: Generating A Hollow Metal Ball

Let's walk through the complete process:

Input: You upload a photo of hollow metal sculpture (single view, can only see one side).

Stage One: Sparse structure generation (about 5 seconds) - Model analyzes photo, generates a 512³ sparse Voxel grid: Mark ball's outer outline, Identify hollow parts (this is empty), Infer invisible backside (using symmetry and common sense). Output skeleton grid marking hundreds of thousands of Voxel positions where object exists.

Stage Two: SLAT generation (about 12 seconds, 1024³ resolution) - Model fills details on skeleton foundation. Assign PBR materials: Base Color (metal's silver-gray with slight oxidation color change), Metallic (0.9 high metallic, but not perfect mirror), Roughness (0.3 slight roughness, has matte feel), Opacity (1.0 opaque, but hollow parts are empty Voxels so automatically transparent). Refine geometry: Hollow edges smooth, not jagged. Sphere curvature uniform. Surface has slight bumps (simulating hand-forged marks). Infer backside: Complete invisible parts based on symmetry, ensure hollow patterns continue coherently on back.

Result: Ready-to-use asset - Clean topology Mesh, only 50k faces (traditional method might have 500k faces still messy). PBR materials can directly import to Unreal Engine, realistic lighting effects. Hollow structure perfectly handled, won't fill into solid ball. Total time 17 seconds (H100 GPU).

Why Is TRELLIS 2 So Fast?

Sparse Compression VAE: Traditional VAE compression is full-image compression, TRELLIS 2's SC-VAE specifically optimizes for sparse 3D data, only processes Voxels with objects, skips air parts, 16x spatial downsampling but almost no visual quality loss.

Flow-Matching: TRELLIS 2 doesn't use traditional Diffusion, but Flow-Matching - more efficient generation method, faster convergence. Analogy: Diffusion is slowly clearing from noise cloud (like photo developing), Flow-Matching is following optimal path directly to target (like GPS navigation).

Native 3D VAE: Directly train VAE in 3D space, no 2D intermediary. Traditional methods project 3D to 2D first (multi-view), then synthesize back to 3D, lossy information. TRELLIS 2 entirely 3D, higher fidelity.

How Does PBR Material Come About?

How does AI know this is metal?

Secret: Visual foundation model plus physical priors. TRELLIS 2 integrates DINOv2 (a powerful visual understanding model): Extract high-level semantic features from input image (this is metal, this is fabric, this is plastic), Combine physical priors (metal generally high Metallic + low Roughness, fabric low Metallic + high Roughness), Generate corresponding PBR properties.

Analogy: Like an experienced material master who can recognize brushed stainless steel or matte aluminum at a glance.

Why is this important? Traditional methods generate materials baked into textures, lighting information fixed. Change lighting environment? Metal stops reflecting, plastic becomes mirror.

TRELLIS 2's PBR materials are physically correct parameters, realistic under any lighting. Drag into game engine or renderer? Lighting automatically correct.

Technical Innovation Translates To User Value

Technical Innovation	User Value
O-Voxel	Can handle hollow, thin walls, complex topology, no longer limited to solid lumps
SLAT Compression	16x compression, blazing fast generation, low VRAM usage
Two-Stage Generation	Stable quality, won't get monsters (big head small legs)
Complete PBR	Import to game engine no adjustment needed, lighting effects work out of box
Sparse VAE	1024³ high resolution still generates fast (17s)
Native 3D	Clean Mesh topology, saves post-optimization time

Some Interesting Details

Why called SLAT? SLAT equals Structured LATent, emphasizing structured - after compression still preserves spatial relationships, not one-dimensional vector. This is the biggest difference from traditional compression methods.

Where does training data come from? TRELLIS 2 trained on 500k 3D assets, covering game models, 3D scanned data, artworks, CAD engineering models. And dataset is open source, you can use it to train your own model too!

Why need 24GB VRAM? 4B parameter model itself needs 16GB, plus 1024³ Voxel grid processing, peak can spike to 24GB. Tip: 2B parameter version only needs 12GB, reduced quality but still stronger than most competitors.

Limitations

TRELLIS 2 also has weaknesses:

May overfit on minimalist 2D illustrations: If you input super minimalist line art, model might hallucinate too many details, because it's trained mainly on realistic models. Solution: Adjust parameters or use first-generation TRELLIS editing features.

Single-view inference still has guesswork: Though TRELLIS 2 is smart, inferring 360° model from single image, backside always has some hallucination. Solution: Provide multi-view input (if available).

Hardware barrier: 24GB VRAM not everyone has, cloud usage has costs. Solution: Use 2B version or cloud platforms (like Hugging Face Spaces).

Summary

From technical perspective: O-Voxel solved traditional Voxel's storage and flexibility problems, SLAT efficient compression made large-scale generation possible, Two-stage balances speed and quality, Complete PBR transforms from toy to production tool.

From user perspective: Fast (seconds to minutes, not hours), Good (clean Mesh, realistic materials, rich details), Save (generated assets directly usable, no extensive post-processing).

This is why more and more game studios, indie developers, designers choose TRELLIS 2 - it transformed AI 3D generation from black tech demo into practical tool.

Want to learn more? Try TRELLIS 2 Free | View Official Paper | GitHub Open Source Code

Next Preview: Practical tutorial - How to use TRELLIS 2 + Blender to make game assets, complete workflow from image to engine!

The Magic Behind TRELLIS 2: From SLAT to O-Voxel

Table of Contents