Abstract

Significant progress has been made recently on challenging tasks in automatic sign language understanding, such as sign language recognition, translation and production. However, most works have focused on datasets with relatively few samples, short recordings and limited vocabulary and signing space. Moreover, they have neglected the less complex task of sign language video classification, whose analogue in spoken language, namely text classification, has been widely addressed. For this reason, in this work, we introduce the novel task of sign language topic detection. We base our experiments on How2Sign, a large-scale video dataset spanning multiple semantic domains. The contributions of this thesis are twofold. First, we present the first study of sign language topic detection in continuous sign language videos, providing baseline models for this task. Second, we perform a comparison between different visual features and deep learning architectures that are commonly employed in the sign language understanding literature. We implement our modelling pipelines in Fairseq, a PyTorch library that is extensively used in the spoken language community. Modular, extensible code for running our experiments is provided along this thesis.