Herrera-Palacio A, Ventura C, Giró-i-Nieto X. Video Object Linguistic Grounding. In ACM Multimedia Workshop on Multimodal Understanding and Learning for Embodied Applications (MULEA). Nice, France: ACM; 2019.  (441.12 KB)


The goal of this work is segmenting on a video sequence the objects which are mentioned in a linguistic description of the scene. We have adapted an existing deep neural network that achieves state of the art performance in semi-supervised video object segmentation, to add a linguistic branch that would generate an attention map over the video frames, making the segmentation of the objects temporally consistent along the sequence.




Xavier Giro-i-Nieto and Amanda Duarte in ACM Multimedia 2019