Video object segmentation (VOS) is a computer vision task that aims at determining the pixels of an object of interest along a video sequence. This thesis explores different curriculum learning strategies for a deep neural network trained to solve this task.

Curriculum learning defines a methodology where the training data are not randomly presented to the model, instead, they are organized in a meaningful way. Simple concepts are first presented and gradually become more complex. Four different curriculum strategies are explored: schedule sampling, frame skipping, the effect of temporal and spatial recurrence variations and loss penalization by the object’s area.

This work focuses on the RVOS neural architecture, a recurrent architecture originally tested on the DAVIS and YouTube-VOS datasets for one-shot video object segmentation, over the cars class of the KITTI-MOTS dataset. Even though this architecture is a fast solution for the VOS task, the model struggles with the KITTI-MOTS dataset, whose videos are more crowded and challenging.

For the schedule sampling curriculum, both the classic and inverse implementations are evaluated. Results show how inverse schedule sampling strategies improve the model’s performance instead of the classic approach, the forward one. The different frame skipping schemes are also beneficial, but only when training with the ground truth mask instead of the predicted ones. Lastly, both the curriculums that vary the temporal and spatial recurrence or penalize the loss by the object’s area have shown poor model’s performance. 

These results show how curriculum learning strategies affect greatly the performance of recurrent neural networks. Moreover, the results on the inverse schedule sampling and frame skipping strategies invite to further explore this schemes to exploit their benefits.