One Perceptron to Rule Them All: Language, Vision, Audio and Speech (tutorial)

Giró-i-Nieto X. One Perceptron to Rule Them All: Language, Vision, Audio and Speech (tutorial). In ACM International Conference on Multimedia Retrieval (ICMR) 2020. Dublin, Ireland: ACM; 2020.

(313.96 KB)

Abstract

Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.

Summary in DL ACM and UPCommons.
ACM International Conference on Multimedia Retrieval (ICMR) 2020.

Part II: Neural Encoders & Decoders [GSlides] [Video]

Part III: Language & Vision: [GSlides] [Video]

Part IV: Audio & Vision: [GSlides] [Video]

Part V: Speech & Vision [GSlides]

Projects

	MALEGRA - Multimodal Signal Processing and Machine Learning on Graphs
	Large Scale Video Tagging with Knowledge Bases
	Language and Vision

Image Processing Group

Search form

User login

Abstract

Projects