Computer Vision Engineer / Sr. 3D GeneralistGet in touch
ML Programming Project · Bournemouth NCCA · 2024

Video Segmentation

A two-model pipeline that segments humans from video and outputs production-ready EXR files for compositing in Nuke.

Results

Output

Training vs validation predictions.

Training and validation results

Original frame, SegNet mask (green), and YOLO bounding box (red).

Mask overlay comparison
About

What It Does

The pipeline separates humans from the background frame by frame. YOLOv8 locates each person with a bounding box, and a custom SegNet then produces a per-pixel mask inside that box.

SegNet is an encoder-decoder architecture designed for pixel-level segmentation. The encoder compresses the image into a compact feature representation. The decoder maps those features back to full resolution, classifying each pixel as foreground or background. Both the segmentation mask and the bounding box are written into separate channels of a multichannel EXR file, ready for compositing.

Approach

Key Decisions

U-Net was tested first but produced edges too rough for production use. SegNet improved mask quality but would sometimes predict in empty areas of the frame. Adding YOLOv8 to constrain where segmentation runs solved this. The bounding box also doubles as a garbage mask for compositing.

Both models were trained from scratch on the Supervisely Person Segmentation dataset.

Built With
SegNet YOLOv8 PyTorch OpenEXR Nuke Python