Research Interests

In spite of the variability of the research problems I worked on so far, there are two main threads in my research path: my interest in understanding and modelling the mechanisms underlying visual perception and my commitment to research that could do service to society in the mid/long run.

Currently, my focus is on:

(i) Human behaviour understanding from large-scale visual data

(ii) Computational models of visual perception

Summaries of the above topics are reported below.


Human behaviour understanding from large-scale personal visual data

Recent advances in wearable devices, followed by the progressive reduction of digital storage cost, have made nowadays possible to take pictures everywhere and continuously in a free hand fashion and affordable way. A natural question that arises is how to exploit these digital memories to improve our quality of life. Preliminary studies conducted in the last decades using the SenseCam have shown the huge potential of wearable cameras in health. However, improvements in ethics, data management and data mining are needed to widespread their use in practice. My current focus is on providing solution to two emerging scenarios in Preventive Medicine:

(i) contrast dementia by cognitive training based on digital memories

(ii) improve well-being by monitoring lifestyle

Both applications require observations over long periods of time (from a couple of weeks to months). For this reason we are working with data acquired by the Narrative clip, which allows to record the full day (2fmp) and is more socially accepted than other commercially available wearable cameras such as the GoPro. However, since the camera is worn as part of clothing, it does not allow to model attention and the very low frame rate does not allow to reliably estimate motion. An overview on the challenges imposed by visual lifelogging from a Computer Vision perspective, can be found in this paper.

With the goal of enabling the above applications, I am working on three research axes:

(i) Summarization

The huge quantity of data composing a visual lifelog and the rate to which they increase (up to 2,000 images per day or approximately 700,000 images every year) imposes the need of efficient methods for summarization. To fully exploit the potential of visual lifelogs in Preventive Medicine, summarization should be oriented to the visualization, indexing and browsing of autobiographic events, with the least semantic loss. To achieve this goal, in this paper we focused on the subproblem of structuring egocentric photo streams into meaningful segments sharing semantic attributes, opening the door to further processing such as activity recognition and semantic summarization. In particular, it is shown that exploiting the temporal coherence and contextual correlations of concepts in egocentric photostreams is possible by integrating computer vision based on Convolutional Neural Networks with language processing. A summarization approach, based on an image retrieval perspective, can be found in this paper. After removing non-informative images by a new CNN-based filter, images are ranked by relevance to ensure semantic diversity and finally re-ranked by a novelty criterion to reduce redundancy.

(ii) Social interaction analysis

Social interactions have a stronger emotional burden with respect to activities performed in isolation and therefore are expected to have a greater potential to trigger autobiographical memory. For this reason, we are working to deliver to medical doctors a customized summary driven by the presence of social interactions. On the other hand, the lack of social interactions is strongly correlated to depression problems, therefore it is an important factor in characterizing lifestyle. In this context, I am co-supervising the PhD thesis of Maedeh Aghaei. The first step towards the detection of social events is to track the appearance of multiple persons involved in it. In this paper, we proposed a method to track multi-faces in the challeginng domain of ego-centric photostreams. The second step has been to recover 3D location and head orientation to statistically analyze F-formations. Premilinary results can be found in this paper. More recently, we proposed to distinguish among people who are interacting with camera wearer and people who are not interacting by modeling the temporal evolution of F-formation's features over time by using Long Short Term Memory recurrent neural networks. More details on this paper. We are currently working on incorporating nongeometrical cues of social interactions such as facial expressions and on the classification of social events.

(ii) Activity recognition

Recognizing daily activities is crucial to characterize lifestyle. Traditional activity recognition methods from videos can broadly be classified depending on the kind of features they use to represent actions; with body movement analysis and the use of the objects involved in the action being the most common choices. In an egocentric setting, general body movements such as running, walking, moving the head/camera or staying still are usually estimated relying on motion features, when allowed by the frame rate of the camera. The use of objects is usually modeled by attention or hand-manipulation-based approaches. The former aims to identify objects to which the user pays particular attention, even in the absence of manipulation, since they could be key factors in self-behaviour recognition. The latter is restricted to scenes and objects where the hands of the user present significant information. Since the Narrative Clip does not allow to model attention nor to reliably estimate motion, the recognition of activities should be built on novel features. Given that the temporal coherence of concepts as well as their contextual correlation is still preserved even when the frame rate is very low, I am interested in finding a way to exploit these properties for activities recognition.


Computational models of visual perception

Due to its interest in the context of egocentric vision, I am currently focusing on

(iii) Event perception

We perceive our world into discrete, temporally extended cognitive segments, called events. The understanding of how this happens in our brain is an active area of research. Previous studies based on behavioural and neuroimaging data have shown that segmenting ongoing activity into meaningful events has consequences for memory and learning. Therefore, a computational model able to predict experimental findings about event perception from egocentric images could potentially be useful to understand how memory works in a real world setting. Recent experimental findings described in this paper, have shown that neural representation of events are not tied to predictive uncertainty, but arise from temporal community structures. Inspired by these findings I am developing a new framework for event segmentation, those preliminary results can be found here


Past research:

Work at the Pompeu Fabra University

During my stay in the (GPI), I brought my focus on the modelling of nonlocal processes in visual perception. In particular, I focused on two organizational phenomena closely related to the perception of depth in images:

(i) Figure-ground organization

Figure-ground organization determines the interpretation of a visual scene into figures (object-like regions) and grounds (background-like regions), thus enabling higher-level processing such as the perception of surfaces, shapes and objects. Despite the advances of the last century, the computational mechanisms underlying figure-ground perception are still poorly understood. Typically, computational models of figure-ground organization first compute configural cues locally and then use a complex model to integrate them into an unitary percept while coping with the inherent ambiguity of local configural cues. In this paper it is shown that configural cues such as convexity, size, surroundness and lower region can be extracted through nonlocal computations, without explicitly relying on object boundaries. This leads to very robust estimations, from which a unitary figure-ground percept can be inferred through a very simple integration mechanism. These results support the findings that figure-ground segregation is rapidly computed by our brain, involving feedback from cells with larger receptive fields in higher visual cortical areas.

(ii) Amodal completion

Amodal completion is the perception of the whole of a physical structure when only parts of it are visible. This involves to somehow detect the occluding and the occluded object and to complete occluded one. Nowadays, it is well acknowledged that occlusion patterns evoke both local and global completion processes, but how the final perception outcome is conveyed is still not well understood. This paper shows that amodal completion can be understood as the result of a best hypothesis selection process. Multiple hypotheses are generated by integrating global and local cues and the best hypothesis is selected by maximizing the posterior probability over the hypothesized interpretations. Following the simplicity principle in perceptual organization, the posterior probabilities take into account viewpoint dependency (complexity of the visible objects) and viewpoint independencies (complexity of the completed object and effort of bringing the objects in their actual configuration). The resulting model estimates the 3D structure of a scene from a planar image, providing both the complete disoccluded objects that form the scene and their ordering according to depth.

Work at the Paris Descartes University

During my stay at MAP5 I worked on two main projects. This first project I worked on was

(i) PAGDEG (causes and consequences of Protein AGgregation in cellular DEGeneration)

This was a national interdisciplinary project addressing the role of protein aggregation in cellular degeneracy. Since the study of living cell dynamics required the analysis of large amounts of data to produce quantitative and statistically relevant results, the team at MAP5 was responsible for the development of automatic video analysis algorithms for time-lapse fluorescent and phase-contrast imaging. We developed algorithms for fluorescent spot detection and tracking using a statistical framework, called a-contrario theory. The a contrario model rests on a perception principle stated by Helmholtz following which the human visual system detects structures in a group of objects when their configuration, according to one or several Gestalt laws, are very unlikely to happen by chance in a random setting. In presence of very noised and poor quality data, particles and trajectories can be characterized by an \acontrario model, This leads to algorithms that do not require a previous learning stage, nor a tedious parameter tuning and are very robust to noise. More details on this paper and on my publications page.

The second project I worked on was

(ii) PERGAME (PERception GAME)

This project was about the realization of a serious online game for the study of detection performances in human vision. In this context, I focused on the nonparametric estimation of a psychometric function, which relates the performance of the players to the stimulus intensity. The estimation involved issues related to statistical modelling and convex optimization. I then focused on the subproblem of concave regression and I did a comprehensive review of the literature to compare the performances of available algorithms to model the problem. This brought me to go in deep into cone regression, classifying, analying and comparing qualitatively and quantitatively several methods, including variational methods with asymptotic convergence and geometrics methods with time finite convergence. I discussed and addressed the limits of current algorithms and I proposed several improvements to enhance numerical stability and to bound computational cost. The results of this study can be found in this paper.

Work at the Collège de France

During my stay at Collège de France, I contributed to develop a computational model that explains the process of transformation of linear accelerations of the head into neural signals. Up to my opinion, this is another beautiful example of nonlocal processing performed by our sensory system. We focused on the striola, a narrow 3D band centered on a curve, that roughly bisects the epithelial surface of otolith organs in mammals. The overall idea underlying our model was of showing that the shape of the striola (characterized by curvature and torsion), is optimally suited to detect changes in linear accelerations (jerk). In our model, striolar afferent neurons integrate nonlinearly the activities of two sensory cells on average. More precisely, the domain of acceleration estimated by an afferent neuron is given by the intersection of the receptive fields of the sensory cells it captures. The predictions made by of our model fit the domain of detected accelerations observed experimentally on monkeys. In addition, our model explains the necessity for the striola to have its characteristic shape in order to carry out its function and also provides an explanation that in rodents afferent neurons contact in average two hair cells on the striola. We also verified that information coded by our model can be decoded by the brain by using a supervised learning algorithm based on back-propagation neural networks and we tested the robustness of our model with respect to neuronal noise. Further details about this work can be found in this paper.

Ph.D. work

My PhD thesis was about monocular depth estimation and has been a pioneering work in this field. I started by developing tools for image inpainting based on partial differential equations and for image filtering based on mathematical morphology, mainly connected operators. I used image inpainting to improve the filtering performance of connected operators [pdf]. The fact that image inpainting was the perfect restitution strategy for the motion operator, gaves me the idea of developing a depth-oriented filter. Estimating depth in single images involved to investigate new methods for monocular depth cue detection, to analyze the problem of depth cue integration, and to exploit depth ordering information to improve classical color segmentation and enable new depth-oriented filtering applications. Inspired by human perception, I proposed a region merging based frameworks for occlusion detection and monocular depth integration. The region-merging based strategy consists in constructing hierarchical region-based representation of images, that incorporates depth ordering information provided by depth cues, and relies on a graph formalization, which encodes depth relationships between regions and allows to infer a global, consistent depth ordering [pdf]. During my stay at ENS-Cachan, I developed new methods for monocular cue detection and I proposed a diffusion-based framework for monocular depth cue integration. The diffusion based strategy, consists in iteratively extends initial depth values arisen from monocular depth cues to the entire image domain until stability is attained by using a bilateral filter [pdf]. The use of the bilateral filter for denoising, which exploits the self-similarity property of images, gave to me the idea of using self-similarity to model statistically a pixel in order to enhance the performances of the statistical region merging algorithm I was using in the region-merging based approach. This allowed to greatly improve segmentation accuracy near boundaries [pdf]. For further details about my Ph.D. thesis work, have a look to my publications page.


Back Home.