When we humans look at an image, we always perform a sequential extraction of information in order to understand its content. First, we fix our gaze to the most salient part of the image, and from the information extracted we guide our look towards another point of the image, until we have analyzed all the relevant information of it. This is our natural and instinctive behaviour to gather information from our surroundings. Traditionally in computer vision, images have been analysed at the local scale following a sliding window scanning, often at different scales. This approach analyses the different parts of the images independently, without constructing a correlation among them. Just by introducing a hierarchical partition of the image, we can more easily exploit the correlation between regions through a top-down scanning which firstly takes a global view of the image to sequentially focus on the local parts that contain the relevant information (eg. objects or faces). Moreover, if we train a deep architecture that is not based on rewarding regions observed independently, such as traditional object proposals, but rewards successful long-term searches by connecting the different regions observed, we can achieve a sequential detection of objects. , which is proven to be richer in information compared to use simple independent fixations.


The goal of this ongoing research is to perform an efficient detection of objects in images. In order to be efficient, the key idea is to focus on those parts of the image which contain richer information and zoom on them, guiding a hierarchical search for objects. An intelligent entity capable of deciding where to focus the attention in the image is trained using deep reinforcement learning techniques. This RL agent first looks the whole image and decides which of the partitions of a quadtree partition is richer in order to find a certain category of objects. The reinforcement learning agent is trained using deep Q-learning using a similar architecture to the one used by DeepMind [1].


This work is based on the key idea that with reinforcement learning we can perform a sequential search that rewards short sequences of searches that obtain the highest long-term reward in terms of intersection over union of predicted bounding boxes and ground truth bounding boxes.


The input of the network is a convolutional descriptor of the region observed at the current step and a history vector that describes the previous steps of the search. This idea was also used in [2]. Our main difference with the approach of such paper, is that we use a fixed hierarchical partition to guide our sequential search. Furthermore, in order to be efficient, sharing of convolutional features is a key aspect of the pipeline of our system. Convolutional features from VGG-16 [3] are extracted from the initial whole resolution picture, and then the descriptors for each subpartition are cropped from this feature map.



[1] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.


[2] Caicedo, J. C., & Lazebnik, S. (2015). Active object localization with deep reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2488-2496).

[3] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representation 2015.