3D instance segmentation for Gaussian splats. Trained on RGB-D scans, applied to photogrammetric captures with zero fine-tuning.
Neither scene below was included in training. The models learned from RGB-D depth sensor scans and were applied to these photogrammetric Gaussian splat captures without any fine-tuning.
A living room scan where each object is identified and assigned a distinct colour.
A single chair extracted from the scene and exported as an independent Gaussian splat with all original visual properties preserved.
SegSplat labels every point in a Gaussian splat with a category and an instance ID, so "chair #2" can be extracted, edited, or replaced independently. The key result is cross-domain transfer: with a frozen Sonata encoder from Meta and a PointGroup decoder trained under heavy augmentation, the same weights learn from clean RGB-D depth scans and segment noisy photogrammetric splats with zero fine-tuning.
The architecture pairs Meta's Sonata encoder with a PointGroup decoder. Sonata is a frozen self-supervised transformer pre-trained on 140,000 scenes across 32 GPUs, so only 13% of the model is trainable and the whole thing fits on a single A100 in Colab. PointGroup then predicts a category for each point and an offset toward its parent object's centre, and points with matching offsets cluster into instances.
Sonata's transformer features need a much wider clustering radius than PointGroup's original U-Net features, and heavy augmentation was what let the frozen encoder generalise from clean RGB-D scans to the noisy distributions of Gaussian splats.
No Gaussian splat dataset supports object-level instance segmentation, so SegSplat trains on ScanNet's clean RGB-D scans and transfers to the noisy, uneven distributions of photogrammetric captures. Freestanding objects segment cleanly, while objects that touch or overlap (chairs tucked under tables) are the hard cases.