Visual Question Answering 2.0. . 2017. (2.59 MB) .
Abstract
This bachelor's thesis explores dierent deep learning techniques to solve the Visual Question-Answering (VQA) task, whose aim is to answer questions about images. We study dierent Convolutional Neural Networks (CNN) to extract the visual representation from images: Kernelized-CNN (KCNN), VGG-16 and Residual Networks (ResNet). We also analyze the impact of using pre-computed word embeddings trained in large datasets (GloVe embeddings). Moreover, we examine dierent techniques of joining representations from dierent modalities. This work has been submitted to the second edition Visual Question Answering Challenge, and obtained a 43.48% of accuracy.