IRI - Perceiving Dynamic Environments: From Surface Geometry to Semantic Representation

PhD Thesis

Perceiving Dynamic Environments: From Surface Geometry to Semantic Representation

Student/s

Syed Farzad Husain

Supervisor/s

Information

Started: 01/06/2011
Finished: 27/10/2016

Description

Robots have to perceive human environments in order to work effectively in domestic and assistive contexts. They need to recognize objects and identify human actions to properly interact with their surroundings. Nowadays, the environment is primarily captured using cameras that deliver color and depth images. Cues obtained from such images are the building blocks on which perception applications are developed. For example, appearance models are used for detecting objects and motion information is extracted from images for recognizing actions. However, given the complex variations of domestic and assistive settings, extracting a set of robust visual cues becomes harder here than in other contexts.

In this thesis we develop a hierarchy of tools to improve the different aspects of robot perception in human-centered, possibly dynamic, environments. We start with the segmentation of single images and then extend the developed techniques to handle videos. Next we develop a surface tracking approach that incorporates our video segmentation method. Afterwards we address the higher-level tasks of semantic segmentation and recognition of scenes. Finally, we focus on action recognition in videos. The introduction of Kinect-style depth sensors is relatively new and their usage in the field of robotics cannot be tracked back more than half a decade ago. Such sensors permit acquiring high-resolution color and depth images at a low cost. Given this opportunity, we dedicate most of our work to the exploitation of the depth information obtained with such sensors, thereby advancing the state-of-the-art in computational perception.

The thesis is conceptually divided into two parts. In the first part, we address the low-level tasks of segmentation and object tracking in depth images. In many cases, depth information provides a better disambiguation of surface boundaries between different objects in a scene than their color counterpart. We exploit this information in a novel depth segmentation scheme that fits quadratic surface models on different surfaces in a competitive fashion. We further extend the method to the video domain by initializing the segmentation results and surface model parameters from the previous frame for the next frame. In this way, we successfully create a video segmentation algorithm, in which the labeling of each surface becomes coherent over time. We also devise a particle-filter-based tracker that uses depth data to track a surface. The tracker is made more robust by combining it with our video segmentation approach.

The segmentation results serve as a useful prior for high-level tasks. In the second part of the thesis we deal with such tasks including (i) object recognition, (ii) pixel-wise object class segmentation, and (iii) action recognition. We propose (i) to address object recognition by creating context-aware conditional random field models. We show the importance of the context in object recognition by modeling geometrical relations between different objects in a scene. These relations are captured as potential edges in a graph. We perform (ii) object class segmentation using a convolutional neural network. The network is trained to minimize a pixel-wise cross entropy loss. We introduce a novel distance-from-wall feature and demonstrate its effectiveness in generating better class proposals for objects that are close to the walls. The final part of the thesis deals with (iii) action recognition. We propose a 2D convolutional neural network extended to a concatenated 3D network that learns to extract features from the spatio-temporal domain of raw video data. The network is trained to predict an action label for each video.

In summary, several perception aspects are addressed with the use of depth information where available. Our main contributions are (a) the introduction of a depth video segmentation scheme, (b) a graphical model for object recognition, and our proposals of deep learning models for (c) object class segmentation and (d) action recognition.

The work is under the scope of the following projects:

GARNICS: Gardening with a cognitive system (web)
IntellAct: Intelligent observation and execution of Actions and manipulations (web)
CINNOVA: Modelos cinemáticos y técnicas de aprendizaje para robots de estructura innovadora (web)
PAU+: Perception and Action in Robotics Problems with Large State Spaces (web)
TextilRob: Robots para el manejo de ropa (web)