Abstract

This Thesis explores di erent approaches using deep learning techniques to predict emotions in videos. Working with videos implies a huge amount of data including visual frames and acoustic samples. The rst step of the project is basically to extract features to represent the videos in small sets of arrays. This procedure is done using pre-trained models based on Convolutional Networks, the state of the art in visual recognition. Firstly, visual features are extracted using 3D convolutions and acoustic features are extracted using VGG19, a pre-trained convolutional model for images fi ne-tuned to accept the audio inputs. Later, these features are fed into a Recurrent model capable of exploiting the temporal information. Emotions are measured in terms of valence and arousal, values between [-1, 1]. Additionally, the same techniques are also used to attempt to predict fear scenes. In consequence, this thesis deals with both regression and classi cation problems. Several architectures and di erent parameters have been tested in order to achieve the best performance. Finally, the results will be published in the MediaEval 2017 Challenge and compared to the state-of-the-art solutions.