Abstract

Computers are acquiring increasing ability to detect high level visual content such as objects in images, but often lack an affective comprehension of this content. Affective computing is useful for behavioral sciences, with applications in brand monitoring or advertisement effect. The main problem of the visual task of mapping affect or emotions to images is overcoming the affective gap between low-level features and the image emotional content.

 

One rising method to capture visual affections is through the use of Adjective-Noun Pair (ANP). ANPs were introduced as a mid-level affect representation to overcome the affective gap by combining nouns, which define the object content, and adjectives, which add a strong emotional bias, yielding concepts such as “happy dog” or “misty morning”.

 

Current state of the art methods approach ANP prediction by training visual classifiers on these pairs. In this work, we hypothesize that the visual contribution between nouns and adjectives differ between ANPs. We propose a feature-based intermediate representation for ANP prediction using specialized convolutional networks for adjectives and nouns separately. By fusing a representation from nouns and adjectives, the network learns how much the nouns and adjectives contribute to each ANP, which a single tower network does not allow.

 

The specialized noun and adjective networks follow an AlexNet-styled architecture. These networks are fused into an intermediate feature representation, and ANPs are then learned from it using a fully-connected network. We investigate noun and adjective contributions with two kinds of fusions. First fusion uses the output of the softmax layer: these are class-probability features, so all dimensions have class-correspondence to adjectives and nouns. Second fusion uses the fc7 layer output: these features contain visual information, allowing interpretation of adjective and noun visual relevance. For the feature contributions of each ANP, we compute a deep Taylor decomposition [1].

 

For experiments, we use a subset of 1,200 ANPs from the tag-based English-MVSO [2] dataset. ANPs are composed by the combination of 350 adjective and 617 nouns. With identical settings to the adjective and noun networks, an ANP classification network is trained end-to-end as the baseline. Using the fc7 features, we improve over the baseline in both top-1 and top-5 accuracy. Also, we observe adjective and nouns contribute differently between ANPs; e.g. for the ANP “pregnant woman”, the adjective contributes the most, while for “cute cat” the predominant contribution is in the noun. Using the probability features we find other insights, as nouns or adjectives co-occurring together, e.g. for “happy halloween” the contribution is also high of the nouns “blood” and “cat”, and of the adjectives “haunted” and “dark”. 

 

Based on experiment results, we confirm our hypothesis of adjective and nouns contributing differently to ANP concepts. Furthermore, our architecture proves to outperform traditional methods by giving insights on the role of adjectives and nouns on the prediction.

 

[1] Montavon, Grégoire, et al. "Deep Taylor Decomposition of Neural Networks." ICML Workshop on Visualization for Deep Learning, 2016.

 

[2] Jou, Brendan, et al. "Visual affect around the world: A large-scale multilingual visual sentiment ontology." ACMM, 2015.