Volumetric Asset Pipeline

Pipeline Input

Explore the Input

The pipeline starts from a dense COLMAP reconstruction built from 283 overlapping photographs. The interactive viewer below shows a reduced 800,000-point sample of the reconstruction alongside every camera position and its source image. This is the raw data before any detection, segmentation, or labeling takes place.

Problem

Thousands of Scans, Zero Structure

Film production teams capture thousands of volumetric scans. Each is a dense, unstructured point cloud with whatever label the operator typed at capture. No object boundaries, no metadata. The pipeline handles rooms, people, and single objects. Rooms and environments bring more challenges: dozens of overlapping objects seen from changing camera angles, all merged into one scan.

Context

Why 2D, Not 3D?

My thesis (SegSplat) proved that a pretrained SONATA encoder and trainable decoder could generalize from RGB-D scans to Gaussian splats. But production needed open-vocabulary labels beyond fixed dataset categories, rich metadata to make assets searchable, and SONATA's training data required commercial licensing from the dataset owners. 2D vision-language models solved all three.

Dataset	License	Commercial
ScanNet v2	Custom ToS	Prohibited
ScanNet++	Custom ToS	Prohibited
ScanNet 200	Same as ScanNet	Prohibited
S3DIS	Stanford Academic	Prohibited
HM3D	Matterport EULA	Prohibited
Matterport3D	Custom Matterport	Prohibited
Structured3D	Custom Kujiale	Prohibited
ASE	Custom	Prohibited
ModelNet40	No formal license	Prohibited
ARKitScenes	Custom Apple	Allowed (<700M MAU)

Pipeline

How It Works

These captures are not sequential image sequences. Someone walks through a room taking shots from different positions, so there is no frame-to-frame continuity and traditional sequential roto tools do not work. The pipeline has to find every instance of each object across all views and figure out which detections are the same physical thing. It runs on a single GPU, so models are loaded and unloaded between stages to fit within VRAM. Qwen3-VL stays loaded through detection, verification, and labeling, while Grounding DINO, the embedding model, and SAM3 each load only when needed.

1. Scene Description

Qwen3-VL, the vision-language model at the core of this system, starts by looking at up to 100 evenly sampled frames and describing what it sees. Room type, style, notable objects, sub-rooms. An open floor plan gets broken down into its kitchen, living, and dining areas. This produces a searchable scene summary that travels with the scan.

Example Output

This is a large, open-plan interior space featuring a kitchen, dining area, and living room. The kitchen has dark cabinetry with brass hardware, white countertops, and a central island with a bookshelf base and two wooden stools. The flooring throughout is light-colored hardwood.

The dining area includes a white table with a wooden base, surrounded by colorful stools and a white chair, positioned near large glass doors leading to an outdoor patio. The living room features a large L-shaped beige tufted sofa, a low coffee table, a wooden media console with a TV, and a tall wooden bookshelf.

The walls are painted in a soft green, and the space is decorated with potted plants, framed artwork, and various decorative items. A large window with curtains and a floor lamp are also visible.

Label
open_plan_living_kitchen_dining_space

Room Type
living_room

Style
Contemporary

Scan Type
interior

2. Object Detection

Qwen3-VL then goes through the capture in segments of 30 frames, sampling every 10th frame, and lists every object it can find. It focuses on furniture, appliances, decor, plants, lighting, and textiles, ignoring things like objects seen through windows and film equipment. Each segment can identify up to 25 unique objects, each with a simple label and a detailed description. Duplicates across segments are removed automatically.

3. Frame Search

Grounding DINO, an open-vocabulary object detector, then takes the per-object descriptions and searches every frame for bounding boxes. Running Qwen3-VL on every frame would take over 20 minutes. Grounding DINO does it in under a minute.

4. Verification

Grounding DINO casts a wide net, so not every detection is correct. Each one is cropped from its bounding box, still linked to the original frame, and sent back to Qwen3-VL with a strict yes/no prompt to filter out false positives.

5. Embedding and Matching

Qwen3-VL-Embedding generates two vectors per detection, one from the crop and one from the full frame. The frame embedding captures surrounding context, so two armchairs that always appear near the same bookshelf score higher than two in different parts of the room.

6. Clustering and Labeling

Agglomerative clustering groups those embeddings by cosine similarity to form object instances. Close-scoring pairs get a Qwen3-VL confirmation before merging. Three separate VL queries per detection then determine the main object name, sub-objects, and supports. Only labels appearing in at least half the frames survive the vote.

7. Segmentation

SAM3, a text-prompted segmentation model, takes each confirmed detection and generates pixel-level masks. The area outside the bounding box is blacked out to focus on the target. Separate masks for the main object, sub-objects, and supports are generated and combined into one mask per detection.

8. 3D Point Selection

Using COLMAP camera positions, each camera's mask is projected onto the existing point cloud. Masks that are fragmented or clipped by image edges are filtered out first. A point is kept if enough cameras agree it belongs to the object, at least 50% of the cameras that see it with a minimum of 3. A coarse pass on sparse points defines a bounding box per object, then a fine pass votes on the dense points within it.

9. Postprocessing

The raw point selections are cleaned up. Objects of the same type that overlap spatially are merged. DBSCAN filters out stray points that survived the mask voting, keeping the cluster closest to the detection cameras. Statistical outlier removal cleans up the remaining geometry, and bounding boxes are recomputed for the final output.

Interactive Demo

Explore the Output

A living room processed by the pipeline. Every object has been identified, segmented, and labeled with bounding boxes computed in 3D space. Drag to orbit, scroll to zoom, right-click to pan.

Open fullscreen ↗

Output

What Comes Out

Every object in the scan gets its own entry. What started as millions of unlabeled points becomes structured, searchable, and individually segmented.

Descriptive Label

"gray_fabric_armchair" not just "chair." Each object gets both a simple label and a detailed open-vocabulary description.

Material & Style

Fabric, wood, metal, composite. Mid-century, industrial, modern.

Scene Description

Room type, layout, and sub-rooms. Identifies open floor plans, separate areas, and overall style.

Sub-Objects & Supports

Items on top (cushions, blankets) and structural parts underneath (wooden legs, metal base). Each gets its own segmentation.

Segmented Point Cloud

Individual point cloud per object. Plus a clean scene with all identified objects removed.

3D Bounding Box

Per-object bounding box computed after outlier removal and stored alongside all object metadata.

Searchable

Every field is queryable. For example: find all wooden furniture, everything mid-century, or every object with cushions.

Built With

Qwen3-VL Qwen3-VL-Embedding Qwen3-VL-Reranker Grounding DINO SAM3 COLMAP Open3D PyTorch Python AWS EC2 (L40S) PostgreSQL + pgvector