Student: Michele Compri

Advisors: Begüm Demir (University of Trento) and Xavier Giro-i-Nieto (UPC)

Recent advances in satellite technology has led to an increased volume of remote sensing (RS) image archives, from which retrieving useful information is challenging. Thus, one important research area in remote sensing (RS) is the content-based retrieval of RS images (CBIR). The performance of the CBIR systems depends on the capability of the RS image features in modeling the content of the images as well as the considered retrieval algorithm that assesses the similarity among the features. Existing CBIR systems in the RS literature assume that each image is categorized by only a single label in terms of a land-cover class that is associated to the most significant content of the image. However, RS images usually have complex content, i.e., there are usually several regions within each image related to multiple land-cover classes. Thus, available CBIR systems are not capable of accurately characterizing and exploiting the high level semantic content of RS images for retrieval problems.

To overcome these problems and to effectively characterize the high-level semantic content of RS images, we investigate effectiveness of different deep learning architectures in the framework of multi-label remote sensing image retrieval problems. This is achieved based on a two-steps strategy. In the first step, aConvolutional Neural Network (CNN) pre-trained for image classification with the ImageNet dataset is used off-the-shelf as a feature extractor. In particular, three popular architectures are explored: 1) VGG16; 2) Inception V3; and 3) ResNet50. VGG16 is a CNN characterized by 16 convolutional layers of stacked 3x3 filters, with intermediate max pooling layers and 3 fully connected layers at the end. Inception V3 is an improved version of the former GoogleNet, which contains more layers but less parameters, by removing fully connected layers and using a global average pooling from the last convolutional layer. ResNet50 is even deeper thanks to the introduction of residual layers, that allow data to flow by skipping the convolutional blocks. In he second step of our research, we modify these three off-the-shelf models by fine-tunning their parameters with a subset of RS images and their multi-label information. Experiments carried out on an archive of aerial images show that fine-tuning CNN architectures with annotated images with multi-labels significantly improve the retrieval accuracy with respect to the standard CBIR methods. We find that fine-tunning using with a multi-class approach achieves better results than than considering each label as an independent class. Due to the space constraints, the detailed information on each step of the proposed method will be given in the full version of the paper. 

[source code on GitHub]