Open-Ended Visual Question-Answering | Image Processing Group

Masuda-Mora I. Open-Ended Visual Question-Answering. Pascual-deLaPuente S, Giró-i-Nieto X. 2016.

(7.03 MB)

Abstract

Advisors: Santiago de la Puente and Xavier Giró-i-Nieto

Studies: Bachelor degree in Science and Telecommunication Technologies Engineering at Telecom BCN-ETSETB from the Technical University of Catalonia (UPC)

Grade: A with honors (10/10.0)

This thesis studies methods to solve Visual Question-Answering (VQA) tasks with a Deep Learning framework.As a preliminary step, we explore Long Short-Term Memory (LSTM) networks used in Natural Language Processing (NLP) to tackle Question-Answering (text based). We then modify the previous model to accept an image as an input in addition to the question. For this purpose, we explore the VGG-16 and K-CNN convolutional neural networks to extract visual features from the image. These are merged with the word embedding or with a sentence embedding of the question to predict the answer. This work was successfully submitted to the Visual Question Answering Challenge 2016, where it achieved a 53,62\% of accuracy in the test dataset. The developed software has followed the best programming practices and Python code style, providing a consistent baseline in Keras for different configurations. The source code and models are publicly available at https://github.com/imatge-upc/vqa-2016-cvprw.

Download the report from UPCommons or arXiv.
Project page on GitHub.

Open-ended Visual Question-Answering from Xavier Giro

Demos and Resources

UPC at CVPRW Visual Question Answering Challenge 2016

Software

Projects

	Deep learning
	Language and Vision

Image Processing Group

Search form

User login

Abstract

Demos and Resources

Projects