The interest for image synthesis has grown exponentially for the last years. Few years ago, it was invented a very powerful tool for this task: Generative Adversarial Networks (GANs). As its high performance in generating realistic images has been proved, nowadays many researchers are putting the focus on cross-modal learning.

Taking advantage of the huge amount of information we can extract from speech (such as identity, gender or emotional state), in this  work  we  explore  its  potential  to  generate  face  images of a speaker by conditioning a GAN with his/her voice.  We propose the enhancement and evaluation of a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding).

This project focus on the enhancement of a previous model proposed by Francisco Roldan. As a result of a deep analysis on the former project strengths and weaknesses, we present a novel dataset collected for this work, with high-quality videos of ten youtubers with notable expressiveness in both the speech and visual signals. Besides, unlike in the preliminary project, four different techniques are proposed in order to assess the results.