Multimodal Speech Recognition

Type Start End
National Sep 2018 Aug 2021
Responsible URL
Ferran Marqués Multimodal Speech Recognition


Speech recognition involves generating sequences of words that match what is being said in recordings of speech.  In recent years, machine learning techniques are increasingly being used in speech recognition mainly due to the widespread availability of training data and the decrease in cost related to large scale computation resources. These two factors made feasible the use of a powerful machine learning technique - deep learning - to create end-to-end speech recognition systems. This, compared to classical methods used in this field, does not require an extensive knowledge of phonetics.

When listening to any kind of speech, humans use prior knowledge about the topic (politics, medicine, sports, etc.) of the speech for better understanding. In contrast, speech recognition systems do not usually use this prior knowledge. The use of contextual information to improve an automatic speech recognition system is explored in this thesis. The output of this thesis will be used by the company Vilynx to transcribe speech from videos that, among others, contain general, sport, and entertainment news.