Oriol B, Luque J, Diego F, Giró-i-Nieto X. Transcription-Enriched Joint Embeddings or Spoken Descriptions of Images and Videos. In CVPR 2020 Workshop on Egocentric Perception, Interaction and Computing. Seattle, WA, USA: arXiv; 2020.  (96.79 KB)