Speech2Signs: Spoken to Sign Language Translation using Neural Networks

Type Start End
Other Nov 2017 Nov 2018
Responsible URL
Xavier Giro-i-Nieto Caffe2 Research Awards 2017


Hearing impairment is the most common communication disorder affecting about 360 million people worldwide according to the World Health Organization. For many of these individuals, American Sign Language (ASL) is their primary mean of communication. Speech2Signs aims to remove the difficulties and barriers which deaf people encounter when watching online video, by automatically generating a puppet interpreter that will translate the speech signal into American Sign Language. While there exist tools that can automatically generate textual captions from video, this solution presents some limitations. Firstly, most pre-lingually deaf people prefer sign language to captions, as it is richer and more natural for them. For example, captions make it very hard to track who is speaking in a scene with multiple people. Secondly, some users present language disorders that prevent them from understanding captions, but can communicate with sign language.

The automatization of the speech to sign language would solve one of the two communication flows of a video relay service (VRS). These existing services provide an online human interpreter in communications between individuals, for example, in domains such as emergency rooms, where patients may need to quickly communicate with the medical personnel. In this project, we will not focus on the opposite communication direction from sign to spoken language, which may me addressed in future calls based on the outcomes of the present one.

This project has been awarded with one of the five Caffe2 Research Awards 2017 granted by Facebook.


Duarte A. Cross-modal Neural Machine Translation for Sign Language Translation. In: Torres J, Giró-i-Nieto X ACM Multimedia Doctoral Consortium. ACM Multimedia Doctoral Consortium. Nice, France: ACM; In Press.
Duarte A, Roldán F, Tubau M, Escur J, Pascual-deLaPuente S, Salvador A, Mohedano E, McGuinness K, Torres J, Giró-i-Nieto X. Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks. In: ICASSP. ICASSP. Brighton, UK: IEEE; 2019. (4.42 MB)
Tubau M. Wav2Pix: Enhancement and Evaluation of a Speech-conditioned Image Generator Duarte A, Giró-i-Nieto X. 2019 .
Surís D, Duarte A, Salvador A, Torres J, Giró-i-Nieto X. Cross-modal Embeddings for Video and Audio Retrieval. In: ECCV 2018 Women in Computer Vision Workshop. ECCV 2018 Women in Computer Vision Workshop. Munich, Germany: Springer; 2018. (1.07 MB)
Duarte A, Camli G, Torres J, Giró-i-Nieto X. Towards Speech to Sign Language Translation. In: ECCV 2018 Workshop on Shortcomings in Vision and Language. ECCV 2018 Workshop on Shortcomings in Vision and Language. ; 2018. (142.48 KB)
Moreno D, Costa-jussà MR, Giró-i-Nieto X. English to ASL Translator for Speech2Signs. 2018 . (1.54 MB)
Roca S. Block-based Speech-to-Speech Translation Duarte A, Giró-i-Nieto X. 2018 . (505.01 KB)
Roldán F. Speech-conditioned Face Generation with Deep Adversarial Networks Pascual-deLaPuente S, Salvador A, McGuinness K, Giró-i-Nieto X. 2018 . (1.79 MB)
Escur J. Exploring Automatic Speech Recognition with TensorFlow Costa-jussà MR, Giró-i-Nieto X. 2018 . (829.82 KB)