Cross-modal Deep Learning between Vision, Language, Audio and Speech

Type Start End
European Oct 2018 Sep 2020
Responsible URL
Xavier Giro-i-Nieto Doctorate INPhINIT Fellowships


INPhINIT "la Caixa" Fellowship Programme 2018


Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision, audio and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalisation properties of deep learning. This project aims at exploring deep neural network architectures capable of projecting any multimedia signal in a joint embedding space based on its shared semantics. This way, an image of a dog would be projected in a similar representation of an audio snippet of a dog barking, the natural language text “a dog barking”, or a speech excerpt of a human reading this sentence aloud.

The joint multimedia embedding space will be used to address two basic applications with a common core technology in terms of machine learning:

a) Cross-modal retrieval: 

Given a data sample characterized by one or more modalities, match them with the most similar data sample from other modalities from a catalogue. This would be the case, for example, of a visual search problem where a picture from a food dish is match with the recipe followed to cook it, or viceversa. 

b) Cross-modal generation: 

Given a data sample characterized by one or more modalities, generate a new signal in another modality matching it. This would be the case of a lip-reading application, where a video stream of a person speaking is used to generate a text or speech signal that may match the spoken words. Another option would be generating an image or scheme based on an oral description of it. Generative models will rely on training schemes such as Variational AutoEncoders (VAE) and/or Generative Adversarial Networks (GANs).