Abstract
Human motion analysis is as a broad area of computer vision that has strongly attracted the interest of researchers in the last decades. Motion analysis covers topics such as human motion tracking and estimation, action and behavior recognition or segmentation of human motion. All these fields are challenging due to different reasons, but mostly because of viewing perspectives, clutter and the imprecise semantics of actions and human motion. The computer vision community has addressed human motion analysis from several perspectives. Earlier approaches often relied on articulated human body models represented in the three-dimensional world. However, due to the traditionally high difficulty and cost of estimating such an articulated structure from video, research has focus on the development of human motion analysis approaches relying on low-level features. Although obtaining impressive results in several tasks, low-level features are typically conditioned by appearance and viewpoint, thus making difficult their application on different scenarios. Nonetheless, the increase in computational power, the massive availability of data and the irruption of consumer-depth cameras is changing the scenario, and with that change human motion analysis through articulated models can be reconsidered. Analyzing and understanding of human motion through 3-dimensional information is still a crucial issue in order to obtain richer models of dynamics and behavior. In that sense, articulated models of the human body offer a compact and view-invariant representation of motion that can be used to leverage motion analysis. In this dissertation, we present several approaches for motion analysis. In particular, we address the problem of pose inference, action recognition and temporal clustering of human motion. Articulated models are the leitmotiv in all the presented approaches. Firstly, we address pose inference by formulating a layered analysis-by-synthesis framework where models are used to generate hypothesis that are matched against video. Based on the same articulated representation upon which models are built, we propose an action recognition framework. Actions are seen as time-series observed through the articulated model and generated by underlying dynamical systems that we hypothesize that are generating the time-series. Such an hypothesis is used in order to develop recognition methods based on time-delay embeddings, which are analysis tools that do not make assumptions on the form of the form of the underlying dynamical system. Finally, we propose a method to cluster human motion sequences into distinct behaviors, without a priori knowledge of the number of actions in the sequence. Our approach relies on the articulated model representation in order to learn a distance metric from pose data. This metric aims at capturing semantics from labeled data in order to cluster unseen motion sequences into meaningful behaviors. The proposed approaches are evaluated using publicly available datasets in order to objectively measure our contributions.
Demos and Resources
ColorTip | Dataset | |
Panoramic TV Control | Demo |