Abstract

Advisors: Marta R. Costa-jussà (TALP, UPC) and Xavier Giro-i-Nieto (GPI, UPC)

Grade: A (9.8/10.9)

Speech recognition is the task aiming to identify words in spoken language and convert them into text. This bachelor's thesis focuses on using deep learning techniques to build an end-to-end Speech Recognition system. As a preliminary step, we overview the most relevant methods carried out over the last several years. Then, we study one of the latest proposals for this end-to-end approach that uses a sequence to sequence model with attention-based mechanisms. Next, we successfully reproduce the model and test it over the TIMIT database. We analyze the similarities and differences between the current implementation proposal and the original theoretical work. And finally, we experiment and contrast using different parameters (e.g. number of layer units, learning rates and batch sizes) and reduce the Phoneme Error Rate in almost 12% relative.

  • Source code by containing the fork from the Nabu project used in this work.