Speech2Signs: Spoken to Sign Language Translation using Neural Networks

Type Start End
Other Nov 2017 Nov 2018
Responsible URL
Xavier Giro-i-Nieto Caffe2 Research Awards 2017


Although recent advancements like the Internet, smartphones and social networks have enabled people to instantly communicate and share knowledge at a global scale, the Deaf community still have very limited access to large parts of the digital world. According to the World Health Organization, hearing impairment is the most common disorder affecting more than 360 million people worldwide and for many of these individuals, Sign Language is their primary mean of communication.

For most of deaf individuals, watching online videos is a challenging task. While some streaming and broadcast services provide accessibility options such as captions, these are available for just a part of the catalog and often in a limited amount of languages. When they are not available, volunteers or relatives may generate them and distribute them through third-party platforms. Moreover, a large portion of the online videos are not from streaming or broadcast services but generated by amateur users. As reported by the company statistics, an average of 400 hours of videos are uploaded everyday on a video-sharing website. These users do not typically create any metadata for accessibility. Their intention is informal, addressed to a reduced audience and produced in a very short time. The huge and growing amount of such online videos requires automatic methods capable of adapting these contents across modalities to make them more accessible to everybody.

Speech2Signs aims to remove these difficulties and communication barriers by making the audio track content from online videos available to deaf and hard-of-hearing people by automatically generating a video-based speech to sign language translation.

This project has been awarded with one of the five Caffe2 Research Awards 2017 granted by Facebook.


Pérez-Granero P. 2D to 3D body pose estimation for sign language with Deep Learning McGuinness K, Giró-i-Nieto X. 2020 . (2.97 MB)
Duarte A. Cross-modal Neural Sign Language Translation. In: Torres J, Giró-i-Nieto X Proceedings of the 27th ACM International Conference on Multimedia - Doctoral Symposium. Proceedings of the 27th ACM International Conference on Multimedia - Doctoral Symposium. Nice, France: ACM; 2019. (392.69 KB)
Duarte A, Roldán F, Tubau M, Escur J, Pascual-deLaPuente S, Salvador A, Mohedano E, McGuinness K, Torres J, Giró-i-Nieto X. Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks. In: ICASSP. ICASSP. Brighton, UK: IEEE; 2019. (4.42 MB)
Tubau M. Wav2Pix: Enhancement and Evaluation of a Speech-conditioned Image Generator Duarte A, Giró-i-Nieto X. 2019 .
Surís D, Duarte A, Salvador A, Torres J, Giró-i-Nieto X. Cross-modal Embeddings for Video and Audio Retrieval. In: ECCV 2018 Women in Computer Vision Workshop. ECCV 2018 Women in Computer Vision Workshop. Munich, Germany: Springer; 2018. (1.07 MB)
Duarte A, Camli G, Torres J, Giró-i-Nieto X. Towards Speech to Sign Language Translation. In: ECCV 2018 Workshop on Shortcomings in Vision and Language. ECCV 2018 Workshop on Shortcomings in Vision and Language. ; 2018. (142.48 KB)
Moreno D, Costa-jussà MR, Giró-i-Nieto X. English to ASL Translator for Speech2Signs. 2018 . (1.54 MB)
Roca S. Block-based Speech-to-Speech Translation Duarte A, Giró-i-Nieto X. 2018 . (505.01 KB)
Roldán F. Speech-conditioned Face Generation with Deep Adversarial Networks Pascual-deLaPuente S, Salvador A, McGuinness K, Giró-i-Nieto X. 2018 . (1.79 MB)
Escur J. Exploring Automatic Speech Recognition with TensorFlow Costa-jussà MR, Giró-i-Nieto X. 2018 . (829.82 KB)