Video Object Linguistic Grounding. In ACM Multimedia Workshop on Multimodal Understanding and Learning for Embodied Applications (MULEA). Nice, France: ACM; 2019. (441.12 KB) .
Abstract
The goal of this work is segmenting on a video sequence the objects which are mentioned in a linguistic description of the scene. We have adapted an existing deep neural network that achieves state of the art performance in semi-supervised video object segmentation, to add a linguistic branch that would generate an attention map over the video frames, making the segmentation of the objects temporally consistent along the sequence.
- Paper in ACM Digital Library and UPCommons.
- ACM Multimedia 2019 Workshop on Multimodal Understanding and Learning for Embodied Applications