A two-model pipeline that segments humans from video and outputs production-ready EXR files for compositing in Nuke.
Training vs validation predictions.
Original frame, SegNet mask (green), and YOLO bounding box (red).
The pipeline separates humans from the background frame by frame. YOLOv8 locates each person with a bounding box, and a custom SegNet then produces a per-pixel mask inside that box.
SegNet is an encoder-decoder architecture designed for pixel-level segmentation. The encoder compresses the image into a compact feature representation. The decoder maps those features back to full resolution, classifying each pixel as foreground or background. Both the segmentation mask and the bounding box are written into separate channels of a multichannel EXR file, ready for compositing.
U-Net was tested first but produced edges too rough for production use. SegNet improved mask quality but would sometimes predict in empty areas of the frame. Adding YOLOv8 to constrain where segmentation runs solved this. The bounding box also doubles as a garbage mask for compositing.
Both models were trained from scratch on the Supervisely Person Segmentation dataset.