Abstract

This bachelor's thesis explores di erent deep learning techniques to solve the Visual Question-Answering (VQA) task, whose aim is to answer questions about images. We study di erent Convolutional Neural Networks (CNN) to extract the visual representation from images: Kernelized-CNN (KCNN), VGG-16 and Residual Networks (ResNet). We also analyze the impact of using pre-computed word embeddings trained in large datasets (GloVe embeddings). Moreover, we examine di erent techniques of joining representations from di erent modalities. This work has been submitted to the second edition Visual Question Answering Challenge, and obtained a 43.48% of accuracy.

[Project page]