Automatic sentiment and emotion understanding of general visual content has recently garnered much research attention. However, the large visual variance associated with\ high-level affective concepts presents a challenge when designing systems with high-performance requirements. One\ popular approach to bridge the {\textquotedblleft}affective gap{\textquotedblright} between\ low-level visual features and affective semantics consists of\ using Adjective Noun Pair (ANP) semantic constructs for\ concepts, e.g. {\textquotedblleft}beautiful landscape{\textquotedblright} or {\textquotedblleft}scary face{\textquotedblright} which\ act as a mid-level representation that can be recognized by\ visual classifers while still carrying an affective bias. In\ this work, we formulate the ANP detection task in images\ over a continuous space defined over an embedding that\ captures the inter-concept relationships between ANPs. We\ show how the compact representations obtained from the\ embeddeding extend the discrete concepts in the ontology\ and can be used for improved visual sentiment and emotion\ prediction, as well as new applications such as zero-shot\ ANP detection.

Recurrent Neural Networks (RNNs) continue to show\ outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models.

Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and dificulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This network can be encouraged to perform fewer state updates through a novel loss term. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline models.

The increasing availability of affect-rich multimedia resources has bolstered interest in understanding sentiment and emotions in and from visual content. Adjective-noun pairs (ANP) are a popular mid-level semantic construct for capturing affect via visually detectable concepts such as {\textquoteleft}{\textquoteleft}cute dog" or {\textquoteleft}{\textquoteleft}beautiful landscape". Current state-of-the-art methods approach ANP prediction by considering each of these compound concepts as individual tokens, ignoring the underlying relationships in ANPs. This work aims at disentangling the contributions of the {\textquoteleft}adjectives{\textquoteright} and {\textquoteleft}nouns{\textquoteright} in the visual prediction of ANPs. Two specialised classifiers, one trained for detecting adjectives and another for nouns, are fused to predict 553 different ANPs. The resulting ANP prediction model is more interpretable as it allows us to study contributions of the adjective and noun components.

Recurrent Neural Networks (RNNs) continue to show \ outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of the baseline RNN models.

Advisors: V{\'\i}ctor Campos (UPC), Brendan Jou (Columbia University), Xavier Gir{\'o}-i-Nieto (UPC) and Shih-Fu Chang (Columbia University)

One of the main problems in visual affective computing is overcoming the affective gap between low-level visual features and the emotional content of the image. One rising method to capture visual affection is through the use of Adjective-Noun Pairs (ANP), a mid-level affect representation. This thesis addresses two challenges related to ANPs: representing ANPs in a structured ontology and improving ANP detectability. The first part develops two techniques to exploit relations between adjectives and nouns for automatic ANP clustering. The second part introduces and analyzes a novel deep neural network for ANP prediction. Based on the hypothesis of a different contribution of the adjective and the noun depending of the ANP, the novel network fuses the feature representations of adjectives and nouns from two independently trained convolutional neural networks.

Computers are acquiring increasing ability to detect high level visual content such as objects in images, but often lack an affective comprehension of this content. Affective computing is useful for behavioral sciences, with applications in brand monitoring or advertisement effect. The main problem of the visual task of mapping affect or emotions to images is overcoming the affective gap between low-level features and the image emotional content.

One rising method to capture visual affections is through the use of Adjective-Noun Pair (ANP). ANPs were introduced as a mid-level affect representation to overcome the affective gap by combining nouns, which define the object content, and adjectives, which add a strong emotional bias, yielding concepts such as {\textquotedblleft}happy dog{\textquotedblright} or {\textquotedblleft}misty morning{\textquotedblright}.

Current state of the art methods approach ANP prediction by training visual classifiers on these pairs. In this work, we hypothesize that the visual contribution between nouns and adjectives differ between ANPs. We propose a feature-based intermediate representation for ANP prediction using specialized convolutional networks for adjectives and nouns separately. By fusing a representation from nouns and adjectives, the network learns how much the nouns and adjectives contribute to each ANP, which a single tower network does not allow.

The specialized noun and adjective networks follow an AlexNet-styled architecture. These networks are fused into an intermediate feature representation, and ANPs are then learned from it using a fully-connected network. We investigate noun and adjective contributions with two kinds of fusions. First fusion uses the output of the softmax layer: these are class-probability features, so all dimensions have class-correspondence to adjectives and nouns. Second fusion uses the fc7 layer output: these features contain visual information, allowing interpretation of adjective and noun visual relevance. For the feature contributions of each ANP, we compute a deep Taylor decomposition [1].

For experiments, we use a subset of 1,200 ANPs from the tag-based English-MVSO [2] dataset. ANPs are composed by the combination of 350 adjective and 617 nouns. With identical settings to the adjective and noun networks, an ANP classification network is trained end-to-end as the baseline. Using the fc7 features, we improve over the baseline in both top-1 and top-5 accuracy. Also, we observe adjective and nouns contribute differently between ANPs; e.g. for the ANP {\textquotedblleft}pregnant woman{\textquotedblright}, the adjective contributes the most, while for {\textquotedblleft}cute cat{\textquotedblright} the predominant contribution is in the noun. Using the probability features we find other insights, as nouns or adjectives co-occurring together, e.g. for {\textquotedblleft}happy halloween{\textquotedblright} the contribution is also high of the nouns {\textquotedblleft}blood{\textquotedblright} and {\textquotedblleft}cat{\textquotedblright}, and of the adjectives {\textquotedblleft}haunted{\textquotedblright} and {\textquotedblleft}dark{\textquotedblright}.\

Based on experiment results, we confirm our hypothesis of adjective and nouns contributing differently to ANP concepts. Furthermore, our architecture proves to outperform traditional methods by giving insights on the role of adjectives and nouns on the prediction.

[1] Montavon, Gr{\'e}goire, et al. "Deep Taylor Decomposition of Neural Networks." ICML Workshop on Visualization for Deep Learning, 2016.

[2] Jou, Brendan, et al. "Visual affect around the world: A large-scale multilingual visual sentiment ontology." ACMM, 2015.

This thesis addresses the problem of visual object retrieval, where a user formulates a query to an image database by providing one or multiple examples of an object of interest. The presented techniques aim both at finding those images in the database that contain the object as well as locating the object in the image and segmenting it from the background.

Every considered image, both the ones used as queries and the ones contained in the target database, is represented as a Binary Partition Tree (BPT), the hierarchy of regions previously proposed by Salembier and Garrido (2000). This data structure offers multiple opportunities and challenges when applied to the object retrieval problem.

One application of BPTs appears during the formulation of the query, when the user must interactively segment the query object from the background. Firstly, the BPT can assist in adjusting an initial marker, such as a scribble or bounding box, to the object contours. Secondly, BPT can also define a navigation path for the user to adjust an initial selection to the appropriate scale.

The hierarchical structure of the BPT is also exploited to extract a new type of visual words named Hierarchical Bag of Regions (HBoR). Each region defined in the BPT is char- acterized with a feature vector that combines a soft quantization on a visual codebook with an ecient bottom-up computation through the BPT. These features allow the definition of a novel feature space, the Parts Space, where each object is located according to the parts that compose it.

HBoR features have been applied to two scenarios for object retrieval, both of them solved by considering the decomposition of the objects in parts. In the first scenario, the query is formulated with a single object exemplar which is to be matched with each BPT in the target database. The matching problem is solved in two stages: an initial top-down one that assumes that the hierarchy from the query is respected in the target BPT, and a second bottom-up one that relaxes this condition and considers region merges which are not in the target BPT.

The second scenario where HBoR features are applied considers a query composed of several visual objects, such as a person, a bottle or a logo. In this case, the provided exemplars are considered as a training set to build a model of the query concept. This model is composed of two levels, a first one where each part is modelled and detected separately, and a second one\ that characterises the combinations of parts that describe the complete object. The analysis process exploits the hierarchical nature of the BPT by using a novel classifier that drives an efficient top-down analysis of the target BPTs.\ \

Full text at tdx.cat