Image and Video Object Segmentation in Low Supervision Scenarios

Bellver M. Image and Video Object Segmentation in Low Supervision Scenarios. Torres J, Giró-i-Nieto X. Computer Architectures. [Barcelona]: Universitat Politecnica de Catalunya; 2021.

Abstract

Image and video segmentation are central tasks within the computer vision field. Nevertheless, deep learning solutions for segmentation typically rely on pixel-level annotations, which are very costly to collect. Likewise, some segmentation systems require human interaction at inference time, which involves effort for the end-user. In this thesis, we look into diverse supervision scenarios for image and video object segmentation. We discern between supervision when learning the model, i.e., which type of annotations are used during training, and supervision at inference, namely which kind of human input is required when running the system. Our target are models that require low forms of supervision.

In the first part of the thesis we present a novel recurrent architecture for video object segmentation that is end-to-end trainable in a fully-supervised setup, and that does not require any post-processing step, i.e., the output of the model directly solves the addressed task. The second part of the thesis aims at lowering the annotation cost, in terms of labeling time, needed to train image segmentation models. We explore semi-supervised pipelines and show results when a very limited budget is available. The third part of the dissertation attempts to alleviate the supervision required by semi-automatic systems at inference time. Particularly, we focus on semi-supervised video object segmentation, which typically requires generating a binary mask for each instance to be tracked. In contrast, we present a model for language-guided video object segmentation, which identifies the object to segment with a natural language expression. We study current benchmarks, propose a novel categorization of referring expressions for video, and identify the main challenges posed by the video task.

Evaluation committee: Zeynep Akata (University of Tübingen), Francesc Moreno-Noguer (UPC IRI-CSIC) and Yannis Kalantidis (Naver Labs Europ).

Report on tdx.cat and GDrive
Video on YouTube.
Tweet 1 and Tweet 2 by @DocXavi.

Projects

Language and Vision

Image Processing Group

Search form

User login

Abstract

Projects