Herrera-Palacio A, Ventura C, Giró-i-Nieto X. Video Object Linguistic Grounding. In ACM Multimedia Workshop on Multimodal Understanding and Learning for Embodied Applications (MULEA). Nice, France: ACM; In Press.  (441.12 KB)

Abstract

The goal of this work is segmenting on a video sequence the objects which are mentioned in a linguistic description of the scene. We have adapted an existing deep neural network that achieves state of the art performance in semi-supervised video object segmentation, to add a linguistic branch that would generate an attention map over the video frames, making the segmentation of the objects temporally consistent along the sequence.