Deep Learning for AI Research Talks (Dec.22, 2017)
Adrià Recasens, Massachussets Institute of Technology (MIT) |
|
Lluís Castrejón, Montreal Institute for Learning Algorithms (MILA) |
|
Ramon Sanabria, Carnegie Mellon University (CMU) |
Description:
UPC TelecomBCN and the UPC Image Processing Group organise a series of research talks related to the contents of the Deep Learning for Artificial Intelligence course of Master MET. The three talks will be presented by Catalan researchers developing their Phds in world-leading institutions in the field of artificial intelligence. In these works, deep learning is applied to the vision and speech , so attendees should be familiar and/or interested in these domains.
Contents of the talks and bios:
11:00 Adrià Recasens, "Gaze360: Gazefollowing from everywhere"
Gaze tracking plays an increasingly important role in human-machine interaction and behavior understanding. While tracking technology continues to improve, most existing trackers are limited by the need to locate sensors in particular orientations with respect to the human face. This is a problem for applications such as home robotics and surveillance, where a subject's head pose relative to the tracking sensor is relatively uncontrolled. In this paper we present a novel approach to appearance-based gaze tracking which operates over 360 degrees of head yaw with respect to the camera. We achieve this by the combination of a custom multi-camera gaze dataset and a deep convolutional network. We show how our network performs for various viewing angles and outperforms state-of-the-art competitors, even in the case of traditional front-facing eye tracking. We also demonstrate application of our method to the related task of gaze following in Internet images.
Short bio:
Adrià Recasens is fourth year PhD student in computer vision at the Computer Science and Artificial Intelligence Laboratory (CSAIL) of the Massachusetts Institute of Technology advised by Professor Antonio Torralba. His research interests range on various topics in computer vision and machine learning. He is focusing most of his research on automatic gaze-following. He received a Telecommunications Engineer's Degree and a Mathematics Licentiate Degree from the Universitat Politècnica de Catalunya.
11:40 Lluís Castrejón, "Annotating Object Instances with a Polygon-RNN"
(This work was awarded the CVPR 2017 Best Paper Award Honorable Mention).
We propose an approach for semi-automatic annotation of object instances. While most current methods treat object segmentation as a pixel-labeling problem, we here cast it as a polygon prediction task, mimicking how most current datasets have been annotated. In particular, our approach takes as input an image crop and produces a vertex of the polygon, one at a time, allowing a human annotator to interfere at any time and correct the point. Our model easily integrates any correction,producing as accurate segmentations as desired by the annotator. We show that our annotation method speeds up the annotation process by factor of 4.7 across all classes in Cityscapes, while achieving 78.4% agreement in IoU with original ground-truth, matching the typical agreement between human annotators. For cars, our speed-up factor is even higher, at 7.3 for an agreement of 82.2%. We further show generalization capabilities of our approach on unseen datasets.
Short bio:
Lluis Castrejon is a PhD student at the Montreal Institute for Learning Algorithms advised by Aaron Courville. He received a MSc in Computer Science from the University of Toronto and a BSc in Computer Science and a BSc in Electrical Engineering from UPC BarcelonaTech. His research focuses on Machine Learning and its applications to Computer Vision and Natural Language Processing.
12:20 Ramon Sanabria, "Open-Domain Audio-Visual Speech Recognition"
Audio-visual speech recognition has long been an active area of research, mostly focusing on improving ASR performance using “lip-reading”. We present “open-domain audio-visual speech recognition”, where we incorporate the semantic context of the speech using object, scene, and action recognition in open-domain videos. We show how all-neural approaches greatly simplify and improve our earlier work on adapting the acoustic and language model of a speech recognizer, and investigate several ways to adapt end-to-end models to this task: working on a corpus of “how-to” videos from the web, an object that can be seen (“car”), or a scene that is being detected (“kitchen”) can be used to condition models on the “context” of the recording, thereby reducing perplexity and improving transcription. We achieve good improvements in all cases, and compare and analyze the respective reductions in word errors to a conventional baseline system. We hope that this work might serve to ultimately unite speech-to-text and image-to-text, in order to eventually achieve something like “video-to-meaning” or multi-media summarization systems.
Short bio:
Ramon Sanabria is a graduate student advised by Prof. Florian Metze at Carnegie Mellon University (CMU), in the School of Computer Science’s Language Technologies Institute (LTI). His main research interests are in the areas of machine learning and sequence analysis. More concretely, his current research focuses on adding physical and abstract context to speech recognition systems.
Note: RSVP here.