Automated scan understanding for film production. Millions of unstructured points go in; organized and individually segmented objects come out.
The pipeline starts from a dense COLMAP reconstruction built from 283 overlapping photographs. The interactive viewer below shows a reduced 800,000-point sample of the reconstruction alongside every camera position and its source image. This is the raw data before any detection, segmentation, or labeling takes place.
Film production teams capture thousands of volumetric scans. Each is a dense, unstructured point cloud with whatever label the operator typed at capture. No object boundaries, no metadata. The pipeline handles rooms, people, and single objects. Rooms and environments bring more challenges: dozens of overlapping objects seen from changing camera angles, all merged into one scan.
My thesis (SegSplat) proved that a pretrained SONATA encoder and trainable decoder could generalize from RGB-D scans to Gaussian splats. But production needed open-vocabulary labels beyond fixed dataset categories, rich metadata to make assets searchable, and SONATA's training data required commercial licensing from the dataset owners. 2D vision-language models solved all three.
These captures are not sequential image sequences. Someone walks through a room taking shots from different positions, so there is no frame-to-frame continuity and traditional sequential roto tools do not work. The pipeline has to find every instance of each object across all views and figure out which detections are the same physical thing. It runs on a single GPU, so models are loaded and unloaded between stages to fit within VRAM. Qwen3-VL stays loaded through detection, verification, and labeling, while Grounding DINO, the embedding model, and SAM3 each load only when needed.
Qwen3-VL, the vision-language model at the core of this system, starts by looking at up to 100 evenly sampled frames and describing what it sees. Room type, style, notable objects, sub-rooms. An open floor plan gets broken down into its kitchen, living, and dining areas. This produces a searchable scene summary that travels with the scan.
Qwen3-VL then goes through the capture in segments of 30 frames, sampling every 10th frame, and lists every object it can find. It focuses on furniture, appliances, decor, plants, lighting, and textiles, ignoring things like objects seen through windows and film equipment. Each segment can identify up to 25 unique objects, each with a simple label and a detailed description. Duplicates across segments are removed automatically.
Grounding DINO, an open-vocabulary object detector, then takes the per-object descriptions and searches every frame for bounding boxes. Running Qwen3-VL on every frame would take over 20 minutes. Grounding DINO does it in under a minute.
Grounding DINO casts a wide net, so not every detection is correct. Each one is cropped from its bounding box, still linked to the original frame, and sent back to Qwen3-VL with a strict yes/no prompt to filter out false positives.
Qwen3-VL-Embedding generates two vectors per detection, one from the crop and one from the full frame. The frame embedding captures surrounding context, so two armchairs that always appear near the same bookshelf score higher than two in different parts of the room.
Agglomerative clustering groups those embeddings by cosine similarity to form object instances. Close-scoring pairs get a Qwen3-VL confirmation before merging. Three separate VL queries per detection then determine the main object name, sub-objects, and supports. Only labels appearing in at least half the frames survive the vote.
SAM3, a text-prompted segmentation model, takes each confirmed detection and generates pixel-level masks. The area outside the bounding box is blacked out to focus on the target. Separate masks for the main object, sub-objects, and supports are generated and combined into one mask per detection.
Using COLMAP camera positions, each camera's mask is projected onto the existing point cloud. Masks that are fragmented or clipped by image edges are filtered out first. A point is kept if enough cameras agree it belongs to the object, at least 50% of the cameras that see it with a minimum of 3. A coarse pass on sparse points defines a bounding box per object, then a fine pass votes on the dense points within it.
The raw point selections are cleaned up. Objects of the same type that overlap spatially are merged. DBSCAN filters out stray points that survived the mask voting, keeping the cluster closest to the detection cameras. Statistical outlier removal cleans up the remaining geometry, and bounding boxes are recomputed for the final output.
A living room processed by the pipeline. Every object has been identified, segmented, and labeled with bounding boxes computed in 3D space. Drag to orbit, scroll to zoom, right-click to pan.
Every object in the scan gets its own entry. What started as millions of unlabeled points becomes structured, searchable, and individually segmented.