2018

Journal

Robust Spatio-Temporal Clustering and Reconstruction of Multiple Deformable Bodies 
A.Agudo and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2018

@article{Agudo_pami2018,
title = {Robust Spatio-Temporal Clustering and Reconstruction of Multiple Deformable Bodies},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {},
number = {},
issn = {},
pages = {},
doi = {},
year = {2018}
}

In this paper we present an approach to reconstruct the 3D shape of multiple deforming objects from a collection of sparse, noisy and possibly incomplete 2D point tracks acquired by a single monocular camera. Additionally, the proposed solution estimates the camera motion and reasons about the spatial segmentation (i.e., identifies each of the deforming objects in every frame) and temporal clustering (i.e., splits the sequence into motion primitive actions). This advances competing work, which mainly tackled the problem for one single object and non-occluded tracks. In order to handle several objects at a time from partial observations, we model point trajectories as a union of spatial and temporal subspaces, and optimize the parameters of both modalities, the non-observed point tracks, the camera motion, and the time-varying 3D shape via augmented Lagrange multipliers. The algorithm is fully unsupervised and does not require any training data at all. We thoroughly validate the method on challenging scenarios with several human subjects performing different activities which involve complex motions and close interaction. We show our approach achieves state-of-the-art 3D reconstruction results, while it also provides spatial and temporal segmentation.

Boosted Random Ferns for Object Detection  
M.Villamizar, J.Andrade, A.Sanfeliu and F.Moreno-Noguer  
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2018

@article{Villamizar_pami2018,
title = {Boosted Random Ferns for Object Detection},
author = {M. Villamizar and J. Andrade and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {40},
number = {2},
issn = {0162-8828},
pages = {272 - 288},
doi = {10.1109/TPAMI.2017.2676778},
year = {2018},
month = {February}
}

In this paper we introduce the Boosted Random Ferns (BRFs) to rapidly build discriminative classifiers for learning and detecting object categories. At the core of our approach we use standard random ferns, but we introduce four main innovations that let us bring ferns from an instance to a category level, and still retain efficiency. First, we define binary features on the histogram of oriented gradients-domain (as opposed to intensity-), allowing for a better representation of intra-class variability. Second, both the positions where ferns are evaluated within the sliding window, and the location of the binary features for each fern are not chosen completely at random, but instead we use a boosting strategy to pick the most discriminative combination of them. This is further enhanced by our third contribution, that is to adapt the boosting strategy to enable sharing of binary features among different ferns, yielding high recognition rates at a low computational cost. And finally, we show that training can be performed online, for sequentially arriving images. Overall, the resulting classifier can be very efficiently trained, densely evaluated for all image locations in about 0.1 seconds, and provides detection rates similar to competing approaches that require expensive and significantly slower processing times. We demonstrate the effectiveness of our approach by thorough experimentation in publicly available datasets in which we compare against state-of-the-art, and for tasks of both 2D detection and 3D multi-view estimation.

A Scalable, Efficient, and Accurate Solution to Non-Rigid Structure from Motion  
A.Agudo and F.Moreno-Noguer  
Computer Vision and Image Understanding (CVIU), 2018

@article{Agudo_cviu2018,
title = {A Scalable, Efficient, and Accurate Solution to Non-Rigid Structure from Motion},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Computer Vision and Image Understanding},
volume = {167},
issue = {C},
issn = {1077-3142},
pages = {121-133},
doi = {10.1016/j.cviu.2018.01.002},
year = {2018},
month = {February}
}

We introduce a new probabilistic point trajectory approach to recover a 3D time-varying shape from RGB video.It can handle scenarios with one or multiple objects, missing, noisy, sparse and dense data, and mild or sharp deformations.In addition, it can incorporate spatial correlation priors that define the similarities between object points.Our approach outperforms state-of-the-art techniques in terms of generality, versatility, accuracy and efficiency. Most Non-Rigid Structure from Motion (NRSfM) solutions are based on factorization approaches that allow reconstructing objects parameterized by a sparse set of 3D points. These solutions, however, are low resolution and generally, they do not scale well to more than a few tens of points. While there have been recent attempts at bringing NRSfM to a dense domain, using for instance variational formulations, these are computationally demanding alternatives which require certain spatial continuity of the data, preventing their use for articulated shapes with large deformations or situations with multiple discontinuous objects. In this paper, we propose incorporating existing point trajectory low-rank models into a probabilistic framework for matrix normal distributions. With this formalism, we can then simultaneously learn shape and pose parameters using expectation maximization, and easily exploit additional priors such as known point correlations. While similar frameworks have been used before to model distributions over shapes, here we show that formulating the problem in terms of distributions over trajectories brings remarkable improvements, especially in generality and efficiency. We evaluate the proposed approach in a variety of scenarios including one or multiple objects, sparse or dense reconstructions, missing observations, mild or sharp deformations, and in all cases, with minimal prior knowledge and low computational cost.

Conference

GANimation: Anatomically-aware Facial Animation from a Single Image   (Oral)
A.Pumarola, A.Agudo, A.M.Martinez, A.Sanfeliu and F.Moreno-Noguer 
European Conference on Computer Vision (ECCV), 2018

@inproceedings{Pumarola_eccv2018,
title = {GANimation: Anatomically-aware Facial Animation from a Single Image},
author = {A. Pumarola and A. Agudo and A.M. Martinez and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2018}
}

Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs' generation process with images of a specific domain, namely a set of images of persons sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combine several of them. Additionally, we propose a fully unsupervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit attention mechanisms that make our network robust to changing backgrounds and lighting conditions. Extensive evaluation show that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild.

Geometry-Aware Network for Non-Rigid Shape Prediction from a Single View
A.Pumarola, A.Agudo, L.Porzi, A.Sanfeliu, V.Lepetit and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

@inproceedings{Pumarola_cvpr2018b,
title = {Geometry-Aware Network for Non-Rigid Shape Prediction from a Single View},
author = {A. Pumarola and A. Agudo and L. Porzi and A. Sanfeliu and V. Lepetit and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

We propose a method for predicting the 3D shape of a deformable surface from a single view. By contrast with previous approaches, we do not need a pre-registered template of the surface, and our method is robust to the lack of texture and partial occlusions. At the core of our approach is a geometry-aware deep architecture that tackles the problem as usually done in analytic solutions: first perform 2D detection of the mesh and then estimate a 3D shape that is geometrically consistent with the image. We train this architecture in an end-to-end manner using a large dataset of synthetic renderings of shapes under different levels of deformation, material properties, textures and lighting conditions. We evaluate our approach on a test split of this dataset and available real benchmarks, consistently improving state-of-the-art solutions with a significantly lower computational time.

Unsupervised Person Image Synthesis in Arbitrary Poses   (Spotlight)
A.Pumarola, A.Agudo, A.Sanfeliu and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

@inproceedings{Pumarola_cvpr2018b,
title = {Unsupervised Person Image Synthesis in Arbitrary Poses},
author = {A. Pumarola and A. Agudo and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

We present a novel approach for synthesizing photorealistic images of people in arbitrary poses using generative adversarial learning. Given an input image of a person and a desired pose represented by a 2D skeleton, our model renders the image of the same person under the new pose, synthesizing novel views of the parts visible in the input image and hallucinating those that are not seen. This problem has recently been addressed in a supervised manner, i.e., during training the ground truth images under the new poses are given to the network. We go beyond these approaches by proposing a fully unsupervised strategy. We tackle this challenging scenario by splitting the problem into two principal subtasks. First, we consider a pose conditioned bidirectional generator that maps back the initially rendered image to the original pose, hence being directly comparable to the input image without the need to resort to any training image. Second, we devise a novel loss function that incorporates content and style terms, and aims at producing images of high perceptual quality. Extensive experiments conducted on the DeepFashion dataset demonstrate that the images rendered by our model are very close in appearance to those obtained by fully supervised approaches.

Image Collection Pop-up: 3D Reconstruction and Clustering of Rigid and Non-Rigid Categories   (Spotlight)
A.Agudo, M.Pijoan and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

@inproceedings{Agudo_cvpr2018,
title = {Image Collection Pop-up: 3D Reconstruction and Clustering of Rigid and Non-Rigid Categories },
author = {A. Agudo and M. Pijoan and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

This paper introduces an approach to simultaneously estimate 3D shape, camera pose, and object and type of deformation clustering, from partial 2D annotations in a multi-instance collection of images. Furthermore, we can indistinctly process rigid and non-rigid categories. This advances existing work, which only addresses the problem for one single object or, if multiple objects are considered, they are assumed to be clustered a priori. To handle this broader version of the problem, we model object deformation using a formulation based on multiple unions of subspaces, able to span from small rigid motion to complex deformations. The parameters of this model are learned via Augmented Lagrange Multipliers, in a completely unsupervised manner that does not require any training data at all. Extensive validation is provided in a wide variety of synthetic and real scenarios, including rigid and non-rigid categories with small and large deformations. In all cases our approach outperforms state-of-the-art in terms of 3D reconstruction accuracy, while also providing clustering results that allow segmenting the images into object instances and their associated type of deformation (or action the object is performing).

Hallucinating Dense Optical Flow from Sparse Lidar for Autonomous Vehicles  
V.Vaquero, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Pattern Recognition (ICPR), 2018

@inproceedings{Vaquero_icpr2018,
title = {Hallucinating Dense Optical Flow from Sparse Lidar for Autonomous Vehicles},
author = {V. Vaquero and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Pattern Recognition, (ICPR)},
year = {2018}
}

In this paper we propose a novel approach to estimate dense optical flow from sparse lidar data acquired on an autonomous vehicle. This is intended to be used as a drop-in replacement of any image-based optical flow system when images are not reliable due to e.g. adverse weather conditions or at night. In order to infer high resolution 2D flows from discrete range data we devise a three-block architecture of multiscale filters that combines multiple intermediate objectives, both in the lidar and image domain. To train this network we introduce a dataset with approximately 20K lidar samples of the Kitti dataset which we have augmented with a pseudo ground-truth image-based optical flow computed using FlowNet2. We demonstrate the effectiveness of our approach on Kitti, and show that despite using the low-resolution and sparse measurements of the lidar, we can regress dense optical flow maps which are at par with those estimated with image-based methods.

2D-to-3D Facial Expression Transfer  
G.Rotger, F.Lumbreras, F.Moreno-Noguer and A.Agudo 
International Conference on Pattern Recognition (ICPR), 2018

@inproceedings{Rotger_icpr2018,
title = {2D-to-3D Facial Expression Transfer},
author = {G. Rotger and F. Lumbreras and F. Moreno-Noguer and A. Agudo},
booktitle = {Proceedings of the International Conference on Pattern Recognition, (ICPR)},
year = {2018}
}

Automatically changing the expression and physical features of a face from an input image is a topic that has been traditionally tackled in a 2D domain. In this paper, we bring this problem to 3D and propose a framework that given an input RGB video of a human face under a neutral expression, initially computes his/her 3D shape and then performs a transfer to a new and potentially non-observed expression. For this purpose, we parameterize the rest shape –obtained from standard factorization approaches over the input video– using a triangular mesh which is further clustered into larger macro-segments. The expression transfer problem is then posed as a direct mapping between this shape and a source shape, such as the blend shapes of an off-the-shelf 3D dataset of human facial expressions. The mapping is resolved to be geometrically consistent between 3D models by requiring points in specific regions to map on semantic equivalent regions. We validate the approach on several synthetic and real examples of input faces that largely differ from the source shapes, yielding very realistic expression transfers even in cases with topology changes, such as a synthetic video sequence of a single-eyed cyclops.

Deformable Motion 3D Reconstruction by Union of Regularized Subspaces  
A.Agudo and F.Moreno-Noguer 
International Conference on Image Processing (ICIP), 2018

@inproceedings{Agudo_icip2018,
title = {Deformable Motion 3D Reconstruction by Union of Regularized Subspaces},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Image Processing, (ICIP)},
year = {2018}
}

This paper presents an approach to jointly retrieve camera pose, time-varying 3D shape, and automatic clustering based on motion primitives, from incomplete 2D trajectories in a monocular video. We introduce the concept of order-varying temporal regularization in order to exploit video data, that can be indistinctly applied to the 3D shape evolution as well as to the similarities between images. This results in a union of regularized subspaces which effectively encodes the 3D shape deformation. All parameters are learned via augmented Lagrange multipliers, in a unified and unsupervised manner that does not assume any training data at all. Experimental validation is reported on human motion from sparse to dense shapes, providing more robust and accurate solutions than state-of-the-art approaches in terms of 3D reconstruction, while also obtaining motion grouping results.

Deep Lidar CNN to Understand the Dynamics of Moving Vehicles  
V.Vaquero, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Robotics and Automation (ICRA), 2018

@inproceedings{Vaquero_icra2018,
title = {Deep Lidar CNN to Understand the Dynamics of Moving Vehicles},
author = {V. Vaquero and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2018}
}

Perception technologies in Autonomous Driving are experiencing their golden age due to the advances in Deep Learning. Yet, most of these systems rely on the semantically rich information of RGB images. Deep Learning solutions applied to the data of other sensors typically mounted on autonomous cars (e.g. lidars or radars) are not explored much. In this paper we propose a novel solution to understand the dynamics of moving vehicles of the scene from only lidar information. The main challenge of this problem stems from the fact that we need to disambiguate the proprio-motion of the “observer” vehicle from that of the external “observed” vehicles. For this purpose, we devise a CNN architecture which at testing time is fed with pairs of consecutive lidar scans. However, in order to properly learn the parameters of this network, during training we introduce a series of so-called pretext tasks which also leverage on image data. These tasks include semantic information about vehicleness and a novel lidar-flow feature which combines standard image-based optical flow with lidar scans. We obtain very promising results and show that including distilled image information only during training, allows improving the inference results of the network at test time, even when image data is no longer used.

2017

Journal

Force-based Representation for Non-Rigid Shape and Elastic Model Estimation 
A.Agudo and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2017

@article{Agudo_pami2017,
title = {Force-based Representation for Non-Rigid Shape and Elastic Model Estimation},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {},
number = {},
issn = {},
pages = {},
doi = {},
year = {2017}
}

This paper addresses the problem of simultaneously recovering 3D shape, pose and the elastic model of a deformable object from only 2D point tracks in a monocular video. This is a severely under-constrained problem that has been typically addressed by enforcing the shape or the point trajectories to lie on low-rank dimensional spaces. We show that formulating the problem in terms of a low-rank force space that induces the deformation and introducing the elastic model as an additional unknown, allows for a better physical interpretation of the resulting priors and a more accurate representation of the actual object’s behavior. In order to simultaneously estimate force, pose, and the elastic model of the object we use an expectation maximization strategy, where each of these parameters are successively learned by partial M-steps. Once the elastic model is learned, it can be transfered to similar objects to code its 3D deformation. Moreover, our approach can robustly deal with missing data, and encode both rigid and non-rigid points under the same formalism. We thoroughly validate the approach on Mocap and real sequences, showing more accurate 3D reconstructions than state-of-the-art, and additionally providing an estimate of the full elastic model with no a priori information.

BreakingNews: Article Annotation by Image and Text Processing 
A.Ramisa, F.Yan, F.Moreno-Noguer and K.Mikolajczyk
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2017

@article{Ramisa_pami2017,
title = {BreakingNews: Article Annotation by Image and Text Processing},
author = {A. Ramisa and F. Yan and F. Moreno-Noguer and K. Mikolajczyk},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {40},
number = {5},
issn = {0162-8828},
pages = {1072 - 1085},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2721945},
year = {2017},
month = {June}
}

Building upon recent Deep Neural Network architectures, current approaches lying in the intersection of computer vision and natural language processing have achieved unprecedented breakthroughs in tasks like automatic captioning or image retrieval. Most of these learning methods, though, rely on large training sets of images associated with human annotations that specifically describe the visual content. In this paper we propose to go a step further and explore the more complex cases where textual descriptions are loosely related to the images. We focus on the particular domain of News articles in which the textual content often expresses connotative and ambiguous relations that are only suggested but not directly inferred from images. We introduce new deep learning methods that address source detection, popularity prediction, article illustration and geolocation of articles. An adaptive CNN architecture is proposed, that shares most of the structure for all the tasks, and is suitable for multitask and transfer learning. Deep Canonical Correlation Analysis is deployed for article illustration, and a new loss function based on Great Circle Distance is proposed for geolocation. Furthermore, we present BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (such as GPS coordinates and popularity metrics). We show this dataset to be appropriate to explore all aforementioned problems, for which we provide a baseline performance using various Deep Learning architectures, and different representations of the textual and visual features. We report very promising results and bring to light several limitations of current state-of-the-art in this kind of domain, which we hope will help spur progress in the field.

Combining Local-Physical and Global-Statistical Models for Sequential Deformable Shape from Motion
A.Agudo and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2017

@article{Agudo_ijcv2017,
title = {Combining Local-Physical and Global-Statistical Models for Sequential Deformable Shape from Motion},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision},
volume = {122},
number = {2},
issn = {0920-5691},
pages = {371-387},
doi = {https://doi.org/10.1007/s11263-016-0972-8},
year = {2017},
month = {April}
}

In this paper, we simultaneously estimate camera pose and non-rigid 3D shape from a monocular video, using a sequential solution that combines local and global representations. We model the object as an ensemble of particles, each ruled by the linear equation of the Newton’s second law of motion. This dynamic model is incorporated into a bundle adjustment framework, in combination with simple regularization components that ensure temporal and spatial consistency. The resulting approach allows to sequentially estimate shape and camera poses, while progressively learning a global low-rank model of the shape that is fed back into the optimization scheme, introducing thus, global constraints. The overall combination of local (physical) and global (statistical) constraints yields a solution that is both efficient and robust to several artifacts such as noisy and missing data or sudden camera motions, without requiring any training data at all. Validation is done in a variety of real application domains, including articulated and non-rigid motion, both for continuous and discontinuous shapes. Our on-line methodology yields significantly more accurate reconstructions than competing sequential approaches, being even comparable to the more computationally demanding batch methods.

3D Human Pose Tracking Priors using Geodesic Mixture Models
E.Simo-Serra, C.Torras and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2017

@article{Simo_ijcv2017,
title = {3D Human Pose Tracking Priors using Geodesic Mixture Models},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision},
volume = {122},
number = {2},
issn = {0920-5691},
pages = {388-408},
doi = {https://doi.org/10.1007/s11263-016-0941-2},
year = {2017},
month = {April}
}

We present a novel approach for learning a finite mixture model on a Riemannian manifold in which Euclidean metrics are not applicable and one needs to resort to geodesic distances consistent with the manifold geometry. For this purpose, we draw inspiration on a variant of the expectation-maximization algorithm, that uses a minimum message length criterion to automatically estimate the optimal number of components from multivariate data lying on an Euclidean space. In order to use this approach on Riemannian manifolds, we propose a formulation in which each component is defined on a different tangent space, thus avoiding the problems associated with the loss of accuracy produced when linearizing the manifold with a single tangent space. Our approach can be applied to any type of manifold for which it is possible to estimate its tangent space. Additionally, we consider using shrinkage covariance estimation to improve the robustness of the method, especially when dealing with very sparsely distributed samples. We evaluate the approach on a number of situations, going from data clustering on manifolds to combining pose and kinematics of articulated bodies for 3D human pose tracking. In all cases, we demonstrate remarkable improvement compared to several chosen baselines.

Random Clustering Ferns for Multimodal Object Recognition
M.Villamizar, A.Garrell, A.Sanfeliu and F.Moreno-Noguer
Neural Computing and Applications, 2017

@article{Villamizar_neurocomputing2017,
title = {Random Clustering Ferns for Multimodal Object Recognition},
author = {M. Villamizar and A. Garrell and A.Sanfeliu and F. Moreno-Noguer},
booktitle = {Neural Computing and Applications},
volume = {28},
number = {9},
issn = {0941-0643},
pages = {2445–2460},
doi = {https://doi.org/10.1007/s00521-016-2284-x},
year = {2017},
month = {September}
}

We propose an efficient and robust method for the recognition of objects exhibiting multiple intra-class modes, where each one is associated to a particular object appearance. The proposed method, called Random Clustering Ferns (RCFs), combines synergically a single and real-time classifier, based on the boosted assembling of extremely-randomized trees (ferns), with an unsupervised and probabilistic approach in order to recognize efficiently object instances in images and discover simultaneously the most prominent appearance modes of the object through tree-structured visual words. In particular, we use Boosted Random Ferns (BRFs) and probabilistic Latent Semantic Analysis (pLSA) to obtain a discriminative and multimodal classifier that automatically clusters the response of its randomized trees in function of the visual object appearance. The proposed method is validated extensively in synthetic and real experiments, showing that the method is capable of detecting objects with diverse and complex appearance distributions in real-time performance.

Learning Depth-aware Deep Representations for Robotic Perception
L.Porzi, S.Rota, A.Peñate-Sánchez, E.Ricci and F.Moreno-Noguer
Robotics and Automation Letters, 2017

@article{Porzi_ral2017,
title = {Learning Depth-aware Deep Representations for Robotic Perception},
author = {L. Porzi and S. Rota and A. Peñate-Sánchez and E. Ricci and F. Moreno-Noguer},
booktitle = {Robotics and Automation Letters},
volume = {2},
number = {2},
issn = {2377-3766},
pages = {468-475},
doi = {https://doi.org/10.1109/LRA.2016.2637444},
year = {2017},
month = {April}
}

Exploiting RGB-D data by means of Convolutional Neural Networks (CNNs) is at the core of a number of robotics applications, including object detection, scene semantic segmentation and grasping. Most existing approaches, however, exploit RGB-D data by simply considering depth as an additional input channel for the network. In this paper we show that the performance of deep architectures can be boosted by introducing DaConv, a novel, general-purpose CNN block which exploits depth to learn scale-aware feature representations. We demonstrate the benefits of DaConv on a variety of robotics oriented tasks, involving affordance detection, object coordinate regression and contour detection in RGB-D images. In each of these experiments we show the potential of the proposed block and how it can be readily integrated into existing CNN architectures.

Teaching Robot's Proactive Behavior Using Human Assistance
A.Garrell, M.Villamizar, F.Moreno-Noguer and A.Sanfeliu
International Journal of Social Robotics, 2017

@article{Garrell_ijsr2017,
title = {Teaching Robot's Proactive Behavior Using Human Assistance},
author = {A. Garrell and M. Villamizar and F. Moreno-Noguer and A. Sanfeliu},
booktitle = {International Journal of Social Robotics},
volume = {2},
number = {9},
issn = {1875-4791},
pages = {231—249},
doi = {https://doi.org/10.1007/s12369-016-0389-0},
year = {2017},
month = {April}
}

In recent years, there has been a growing interest in enabling autonomous social robots to interact with people. However, many questions remain unresolved regarding the social capabilities robots should have in order to perform this interaction in an ever more natural manner. In this paper, we tackle this problem through a comprehensive study of various topics involved in the interaction between a mobile robot and untrained human volunteers for a variety of tasks. In particular, this work presents a framework that enables the robot to proactively approach people and establish friendly interaction. To this end, we provided the robot with several perception and action skills, such as that of detecting people, planning an approach and communicating the intention to initiate a conversation while expressing an emotional status. We also introduce an interactive learning system that uses the person’s volunteered assistance to incrementally improve the robot’s perception skills. As a proof of concept, we focus on the particular task of online face learning and recognition. We conducted real-life experiments with our Tibi robot to validate the framework during the interaction process. Within this study, several surveys and user studies have been realized to reveal the social acceptability of the robot within the context of different tasks.

TED: A Tolerant Edit Distance for Segmentation Evaluation
J.Funke, J.Klein, F.Moreno-Noguer, A.Cardona and M.Cook
Methods, 2017

@article{Funke_methods2017,
title = {TED: A Tolerant Edit Distance for Segmentation Evaluation},
author = {J. Funke and J. Klein and F. Moreno-Noguer and A. Cardona and M. Cook},
booktitle = {Methods},
volume = {115},
number = {15},
issn = {1046-2023},
pages = {119—127},
doi = {https://doi.org/10.1016/j.ymeth.2016.12.013},
year = {2017},
month = {February}
}

In this paper, we present a novel error measure to compare a computer-generated segmentation of images or volumes against ground truth. This measure, which we call Tolerant Edit Distance (TED), is motivated by two observations that we usually encounter in biomedical image processing: (1) Some errors, like small boundary shifts, are tolerable in practice. Which errors are tolerable is application dependent and should be explicitly expressible in the measure. (2) Non-tolerable errors have to be corrected manually. The effort needed to do so should be reflected by the error measure. Our measure is the minimal weighted sum of split and merge operations to apply to one segmentation such that it resembles another segmentation within specified tolerance bounds. This is in contrast to other commonly used measures like Rand index or variation of information, which integrate small, but tolerable, differences. Additionally, the TED provides intuitive numbers and allows the localization and classification of errors in images or volumes. We demonstrate the applicability of the TED on 3D segmentations of neurons in electron microscopy images where topological correctness is arguable more important than exact boundary locations. Furthermore, we show that the TED is not just limited to evaluation tasks. We use it as the loss function in a max-margin learning framework to find parameters of an automatic neuron segmentation algorithm. We show that training to minimize the TED, i.e., to minimize crucial errors, leads to higher segmentation accuracy compared to other learning methods.

Conference

3D Human Pose Estimation from a Single Image via Distance Matrix Regression
F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2017

@inproceedings{Moreno_cvpr2017,
title = {3D Human Pose Estimation from a Single Image via Distance Matrix Regression},
author = {F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017}
}

This paper addresses the problem of 3D human pose estimation from a single image. We follow a standard two-step pipeline by first detecting the 2D position of the N body joints, and then using these observations to infer 3D pose. For the first step, we use a recent CNN-based detector. For the second step, most existing approaches perform 2N-to-3N regression of the Cartesian joint coordinates. We show that more precise pose estimates can be obtained by representing both the 2D and 3D human poses using N × N distance matrices, and formulating the problem as a 2D-to-3D distance matrix regression. For learning such a regressor we leverage on simple Neural Network architectures, which by construction, enforce positivity and symmetry of the predicted matrices. The approach has also the advantage to naturally handle missing observations and allowing to hypothesize the position of non-observed joints. Quantitative results on Humaneva and Human3.6M datasets demonstrate consistent performance gains over state-of-the-art. Qualitative evaluation on the images in-the-wild of the LSP dataset, using the regressor learned on Human3.6M, reveals very promising generalization results.

DUST: Dual Union of Spatio-Temporal Subspaces for Monocular Multiple Object 3D Reconstruction
A.Agudo and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2017

@inproceedings{Agudo_cvpr2017,
title = {DUST: Dual Union of Spatio-Temporal Subspaces for Monocular Multiple Object 3D Reconstruction},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017}
}

We present an approach to reconstruct the 3D shape of multiple deforming objects from incomplete 2D trajectories acquired by a single camera. Additionally, we simultaneously provide spatial segmentation (i.e., we identify each of the objects in every frame) and temporal clustering (i.e., we split the sequence into primitive actions). This advances existing work, which only tackled the problem for one single object and non-occluded tracks. In order to handle several objects at a time from partial observations, we model point trajectories as a union of spatial and temporal subspaces, and optimize the parameters of both modalities, the non-observed point tracks and the 3D shape via augmented Lagrange multipliers. The algorithm is fully unsupervised and results in a formulation which does not need initialization. We thoroughly validate the method on challenging scenarios with several human subjects performing different activities which involve complex motions and close interaction. We show our approach achieves state-of-the-art 3D reconstruction results, while it also provides spatial and temporal segmentation.

3D CNNs on Distance Matrices for Human Action Recognition
A.Hernandez, L.Porzi, S.Rota and F.Moreno-Noguer 
ACM Conference on Multimedia (ACM'MM), 2017

@inproceedings{Hernandez_acmmm2017,
title = {3D CNNs on Distance Matrices for Human Action Recognition},
author = {A. Hernandez and L. Porzi and S. Rota and F. Moreno-Noguer},
booktitle = {Proceedings of the ACM Conference on Multimedia (ACM'MM)},
year = {2017}
}

In this paper we are interested in recognizing human actions from sequences of 3D skeleton data. For this purpose we combine a 3D Convolutional Neural Network with body representations based on Euclidean Distance Matrices (EDMs), which have been recently shown to be very effective to capture the geometric structure of the human pose. One inherent limitation of the EDMs, however, is that they are defined up to a permutation of the skeleton joints, i.e., randomly shuffling the ordering of the joints yields many different representations. In oder to address this issue we introduce a novel architecture that simultaneously, and in an end-to-end manner, learns an optimal transformation of the joints, while optimizing the rest of parameters of the convolutional network. The proposed approach achieves state-of-the-art results on 3 benchmarks, including the recent NTU RGB-D dataset, for which we improve on previous LSTM-based methods by more than 10 percentage points, also surpassing other CNN-based methods while using almost 1000 times fewer parameters.

Global Model with Local Interpretation for Dynamic Shape Reconstruction
A.Agudo and F.Moreno-Noguer 
Winter Conference on Applications of Computer Vision (WACV), 2017

@inproceedings{Agudo_wacv2017,
title = {Global Model with Local Interpretation for Dynamic Shape Reconstruction},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)},
year = {2017}
}

The most standard approach to resolve the inherent ambiguities of the non-rigid structure from motion problem is using low-rank models that approximate deforming shapes by a linear combination of rigid basis. These models are typically global, i.e., each shape basis contributes equally to all points of the surface. While this approach has been shown effective to represent smooth deformations, its performance degrades for surfaces composed of various regions, each following a different deformation rule. Piecewise methods attempt to capture this type of behavior by locally modeling surface patches, although they subsequently require enforcing global constraints to assemble back the patches. In this paper we propose an approach that combines the best of global and local models: it locally considers low-rank models but, by construction, does not need to impose global constraints to guarantee local patch continuity. We achieve this by a simple expectation maximization strategy that besides learning global shape bases, it locally adapts their contribution to each specific surface region. Furthermore, as a side contribution, in order to split the surface into different local patches, we propose a novel physically-based mesh segmentation approach that obeys an energy criterion. The complete framework is evaluated in both synthetic and real datasets, and shows an improved performance to competing methods.

Multi-Modal Joint Embedding for Fashion Product Retrieval
A.Rubio, L.Yu, E.Simo-Serra and F.Moreno-Noguer 
International Conference on Image Processing (ICIP), 2017

@inproceedings{Rubio_icip2017,
title = {Multi-Modal Joint Embedding for Fashion Product Retrieval},
author = {A. Rubio and L. Yu and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Image Processing (ICIP)},
year = {2017}
}

Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem, akin to finding a needle in a haystack. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space, which is both efficient and accurate. We train this embedding using large-scale real world e-commerce data by both minimizing the similarity between related products and using auxiliary classification networks to that encourage the embedding to have semantic meaning. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset. We also provide an analysis of the different metadata.

Joint Coarse-and-fine Reasoning for Deep Optical Flow
V.Vaquero, G.Ros, F.Moreno-Noguer, A.López and A.Sanfeliu 
International Conference on Image Processing (ICIP), 2017

@inproceedings{Vaquero_icip2017,
title = {Joint Coarse-and-fine Reasoning for Deep Optical Flow},
author = {V. Vaquero and G. Ros and F. Moreno-Noguer and A. López and A. Sanfeliu},
booktitle = {Proceedings of the International Conference on Image Processing (ICIP)},
year = {2017}
}

We propose a novel representation for dense pixel-wise estimation tasks using CNNs that boosts accuracy and reduces training time, by explicitly exploiting joint coarse-and-fine reasoning. The coarse reasoning is performed over a discrete classification space to obtain a general rough solution, while the fine details of the solution are obtained over a continuous regression space. In our approach both components are jointly estimated, which proved to be beneficial for improving estimation accuracy. Additionally, we propose a new network architecture, which combines coarse and fine components by treating the fine estimation as a refinement built on top of the coarse solution, and therefore adding details to the general prediction. We apply our approach to the challenging problem of optical flow estimation and empirically validate it against state-of-the-art CNN-based solutions trained from scratch and tested on large optical flow datasets.

PL-SLAM: Real-Time Monocular Visual SLAM with Points and Lines  
A.Pumarola, A.Vakhitov, A.Agudo, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Robotics and Automation (ICRA), 2017

@inproceedings{Pumarola_icra2017,
title = {PL-SLAM: Real-Time Monocular Visual SLAM with Points and Lines},
author = {A. Pumarola and A. Vakhitov and A. Agudo and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2017}
}

Low textured scenes are well known to be one of the main Achilles heels of geometric computer vision algorithms relying on point correspondences, and in particular for visual SLAM. Yet, there are many environments in which, despite being low textured, one can still reliably estimate line-based geometric primitives, for instance in city and indoor scenes, or in the so-called “Manhattan worlds”, where structured edges are predominant. In this paper we propose a solution to handle these situations. Specifically, we build upon ORB-SLAM, presumably the current state-of-the-art solution both in terms of accuracy as efficiency, and extend its formulation to simultaneously handle both point and line correspondences. We propose a solution that can even work when most of the points are vanished out from the input images, and, interestingly it can be initialized from solely the detection of line correspondences in three consecutive frames. We thoroughly evaluate our approach and the new initialization strategy on the TUM RGB-D benchmark and demonstrate that the use of lines does not only improve the performance of the original ORB-SLAM solution in poorly textured frames, but also systematically improves it in sequence frames combining points and lines, without compromising the efficiency.

Learning Depth-aware Deep Representations for Robotic Perception  
L.Porzi, A.Peñate-Sánchez, E.Ricci and F.Moreno-Noguer 
International Conference on Intelligent Robots and Systems (IROS), 2017

@inproceedings{Porzi_iros2017,
title = {Learning Depth-aware Deep Representations for Robotic Perception},
author = {L. Porzi and A. Peñate-Sánchez and E. Ricci and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Intelligent Robots and Systems (IROS)},
year = {2017}
}

Most recent approaches to 3D pose estimation from RGB-D images address the problem in a two-stage pipeline. First, they learn a classifier –typically a random forest– to predict the position of each input pixel on the object surface. These estimates are then used to define an energy function that is minimized w.r.t. the object pose. In this paper, we focus on the first stage of the problem and propose a novel classifier based on a depth-aware Convolutional Neural Network. This classifier is able to learn a scale-adaptive regression model that yields very accurate pixel-level predictions, allowing to finally estimate the pose using a simple RANSAC-based scheme, with no need to optimize complex ad hoc energy functions. Our experiments on publicly available datasets show that our approach achieves remarkable improvements over state-of-the-art methods.

Deconvolutional Networks for Point-cloud Vehicle Detection and Tracking in Driving Scenarios  
V.Vaquero, I. del Pino, F.Moreno-Noguer, J.Solà, A.Sanfeliu and J.Andrade-Cetto 
European Conference on Mobile Robots (ECMR), 2017

@inproceedings{Vaquero_ecmr2017,
title = {Deconvolutional Networks for Point-cloud Vehicle Detection and Tracking in Driving Scenarios},
author = {V. Vaquero and I. del Pino and F. Moreno-Noguer and J. Solà and A. Sanfeliu and J. Andrade-Cetto},
booktitle = {Proceedings of the European Conference on Mobile Robots (ECMR)},
year = {2017}
}

Vehicle detection and tracking is a core ingredient for developing autonomous driving applications in urban scenarios. Recent image-based Deep Learning (DL) techniques are obtaining breakthrough results in these perceptive tasks. However, DL research has not yet advanced much towards processing 3D point clouds from lidar range-finders. These sensors are very common in autonomous vehicles since, despite not providing as semantically rich information as images, their performance is more robust under harsh weather conditions than vision sensors. In this paper we present a full vehicle detection and tracking system that works with 3D lidar information only. Our detection step uses a Convolutional Neural Network (CNN) that receives as input a featured representation of the 3D information provided by a Velodyne HDL-64 sensor and returns a per-point classification of whether it belongs to a vehicle or not. The classified point cloud is then geometrically processed to generate observations for a multi-object tracking system implemented via a number of Multi-Hypothesis Extended Kalman Filters (MH-EKF) that estimate the position and velocity of the surrounding vehicles. The system is thoroughly evaluated on the KITTI tracking dataset, and we show the performance boost provided by our CNN-based vehicle detector over a standard geometric approach. Our lidar-based approach uses about a 4% of the data needed for an image-based detector with similarly competitive results.

Low Resolution Lidar-Based Multi-Object Tracking for Driving Applications  
I. del Pino, V.Vaquero, B.Masini, J.Solà, F.Moreno-Noguer, A.Sanfeliu and J.Andrade-Cetto 
Iberian Robotics Conference, ROBOT, 2017

@inproceedings{DelPino_Robot2017,
title = {Low Resolution Lidar-Based Multi-Object Tracking for Driving Applications},
author = {I. del Pino and V. Vaquero and B. Masini and J. Solà and F. Moreno-Noguer and A. Sanfeliu and J. Andrade-Cetto},
booktitle = {Proceedings of the Third Iberian Robotics Conference, ROBOT},
year = {2017}
}

Vehicle detection and tracking in real scenarios are key components to develop assisted and autonomous driving systems. Lidar sensors are specially suitable for this task, as they bring robustness to harsh weather conditions while providing accurate spatial information. However, the resolution provided by point cloud data is very scarce in comparison to camera images. In this work we explore the possibilities of Deep Learning (DL) methodologies applied to low resolution 3D lidar sensors such as the Velodyne VLP-16 (PUCK), in the context of vehicle detection and tracking. For this purpose we developed a lidar-based system that uses a Convolutional Neural Network (CNN), to perform point-wise vehicle detection using PUCK data, and Multi-Hypothesis Extended Kalman Filters (MH-EKF), to estimate the actual position and velocities of the detected vehicles. Comparative studies between the proposed lower resolution (VLP-16) tracking system and a high-end system, using Velodyne HDL-64, were carried out on the Kitti Tracking Benchmark dataset. Moreover, to analyze the influence of the CNN-based vehicle detection approach, comparisons were also performed with respect to the geometric-only detector. The results demonstrate that the proposed low resolution Deep Learning architecture is able to successfully accomplish the vehicle detection task, outperforming the geometric baseline approach. Moreover, it has been observed that our system achieves a similar tracking performance to the high-end HDL-64 sensor at close range. On the other hand, at long range, detection is limited to half the distance of the higher-end sensor.

Workshop

Multi-Modal Embedding for Main Product Detection in Fashion   (Best Paper Award)
A.Rubio, L.Yu, E.Simo-Serra and F.Moreno-Noguer 
Fashion Workshop in International Conference on Computer Vision (ICCVw), 2017

@inproceedings{Rubio_iccvw2017,
title = {Multi-Modal Embedding for Main Product Detection in Fashion},
author = {A. Rubio and L. Yu and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision Workshops (ICCVW)},
year = {2017}
}

We present an approach to detect the main product in fashion images by exploiting the textual metadata associated with each image. Our approach is based on a Convolutional Neural Network and learns a joint embedding of object proposals and textual metadata to predict the main product in the image. We additionally use several complementary classification and overlap losses in order to improve training stability and performance. Our tests on a large-scale dataset taken from eight e-commerce sites show that our approach outperforms strong baselines and is able to accurately detect the main product in a wide diversity of challenging fashion images.

The BreakingNews Dataset  
A.Ramisa, F.Yan, F.Moreno-Noguer and K.Mikolajczyk 
Workshop on Vision and Language, 2017

@inproceedings{Ramisa_VL2017,
title = {The BreakingNews Dataset},
author = {A. Ramisa and F. Yan and F. Moreno-Noguer and K. Mikolajczyk},
booktitle = {Workshop on Vision and Language},
year = {2017}
}

We present BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (e.g. GPS coordinates and popularity metrics). The tenuous connection between the images and text in news data is appropriate to take work at the intersection of Computer Vision and Natural Language Processing to the next step, hence we hope this dataset will help spur progress in the field.

Multi-Modal Fashion Product Retrieval  
A.Rubio, L.Yu, E.Simo-Serra and F.Moreno-Noguer 
Workshop on Vision and Language, 2017

@inproceedings{Rubio_VL2017,
title = {Multi-Modal Fashion Product Retrieval},
author = {A. Rubio and L. Yu and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Workshop on Vision and Language},
year = {2017}
}

Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset.

2016

Journal

Sequential Non-Rigid Structure from Motion using Physical Priors 
A.Agudo, F.Moreno-Noguer, B.Calvo and J.M.M.Montiel
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2016

@article{Agudo_pami2016,
title = {Sequential Non-Rigid Structure from Motion using Physical Priors},
author = {A. Agudo and F. Moreno-Noguer and B. Calvo and J.M.M. Montiel},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {38},
number = {5},
issn = {0162-8828},
pages = {979-994},
doi = {10.1109/TPAMI.2015.2469293},
year = {2016},
month = {May}
}

We propose a new approach to simultaneously recover camera pose and 3D shape of non-rigid and potentially extensible surfaces from a monocular image sequence. For this purpose, we make use of the EKF-SLAM (Extended Kalman Filter based Simultaneous Localization And Mapping) formulation, a Bayesian optimization framework traditionally used in mobile robotics for estimating camera pose and reconstructing rigid scenarios. In order to extend the problem to a deformable domain we represent the object’s surface mechanics by means of Navier’s equations, which are solved using a FEM (Finite Element Method). With these main ingredients, we can further model the material’s stretching, allowing us to go a step further than most of current techniques, typically constrained to surfaces undergoing isometric deformations. We extensively validate our approach in both real and synthetic experiments, and demonstrate its advantages with respect to competing methods. More specifically, we show that besides simultaneously retrieving camera pose and non-rigid shape, our approach is adequate for both isometric and extensible surfaces, does not require neither batch processing all the frames nor tracking points over the whole sequence and runs at several frames per second.

A 3D Descriptor to Detect Task-oriented Grasping Points in Clothing 
A.Ramisa, G.Alenyà, F.Moreno-Noguer and C.Torras
Pattern Recognition, 2016

@article{Ramisa_pr2016,
title = {A 3D Descriptor to Detect Task-oriented Grasping Points in Clothing},
author = {A. Ramisa and G. Alenya and F. Moreno-Noguer and C. Torras},
booktitle = {Pattern Recognition},
volume = {60},
number = {C},
issn = {0031-3203},
pages = {936-948},
doi = {10.1016/j.patcog.2016.07.003},
year = {2016}
month = {December}
}

Manipulating textile objects with a robot is a challenging task, especially because the garment perception is difficult due to the endless configurations it can adopt, coupled with a large variety of colors and designs. Most current approaches follow a multiple re-grasp strategy, in which clothes are sequentially grasped from different points until one of them yields a recognizable configuration. In this work we propose a method that combines 3D and appearance information to directly select a suitable grasping point for the task at hand, which in our case consists of hanging a shirt or a polo shirt from a hook. Our method follows a coarse-to-fine approach in which, first, the collar of the garment is detected and, next, a grasping point on the lapel is chosen using a novel 3D descriptor. In contrast to current 3D descriptors, ours can run in real time, even when it needs to be densely computed over the input image. Our central idea is to take advantage of the structured nature of range images that most depth sensors provide and, by exploiting integral imaging, achieve speed-ups of two orders of magnitude with respect to competing approaches, while maintaining performance. This makes it especially adequate for robotic applications as we thoroughly demonstrate in the experimental section.

Real-Time 3D Reconstruction of Non-Rigid Shapes with a Single Moving Camera 
A.Agudo, F.Moreno-Noguer, B.Calvo and J.M.M.Montiel
Computer Vision and Image Understanding (CVIU), 2016

@article{Agudo_cviu2016,
title = {Real-Time 3D Reconstruction of Non-Rigid Shapes with a Single Moving Camera},
author = {A. Agudo and F. Moreno-Noguer and B. Calvo and J.M.M. Montiel},
booktitle = {Computer Vision and Image Understanding},
volume = {153},
issue = {C},
issn = {1077-3142},
pages = {37-54},
doi = {10.1016/j.cviu.2016.05.004},
year = {2016},
month = {December}
}

This paper describes a real-time sequential method to simultaneously recover the camera motion and the 3D shape of deformable objects from a calibrated monocular video. For this purpose, we consider the Navier-Cauchy equations used in 3D linear elasticity and solved by finite elements, to model the time-varying shape per frame. These equations are embedded in an extended Kalman filter, resulting in sequential Bayesian estimation approach. We represent the shape, with unknown material properties, as a combination of elastic elements whose nodal points correspond to salient points in the image. The global rigidity of the shape is encoded by a stiffness matrix, computed after assembling each of these elements. With this piecewise model, we can linearly relate the 3D displacements with the 3D acting forces that cause the object deformation, assumed to be normally distributed. While standard finite-element-method techniques require imposing boundary conditions to solve the resulting linear system, in this work we eliminate this requirement by modeling the compliance matrix with a generalized pseudoinverse that enforces a pre-fixed rank. Our framework also ensures surface continuity without the need for a post-processing step to stitch all the piecewise reconstructions into a global smooth shape. We present experimental results using both synthetic and real videos for different scenarios ranging from isometric to elastic deformations. We also show the consistency of the estimation with respect to 3D ground truth data, include several experiments assessing robustness against artifacts and finally, provide an experimental validation of our performance in real time at frame rate for small maps.

Interactive Multiple Object Learning with Scanty Human Supervision 
M.Villamizar, A.Garrell, A.Sanfeliu and F.Moreno-Noguer
Computer Vision and Image Understanding (CVIU), 2016

@article{Villamizar_cviu2016,
title = {Interactive Multiple Object Learning with Scanty Human Supervision},
author = {M. Villamizar and A. Garrell and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Computer Vision and Image Understanding},
volume = {149},
issue = {C},
issn = {1077-3142},
pages = {51-64},
doi = {10.1016/j.cviu.2016.03.010},
year = {2016},
month = {August}
}

We present a fast and online human-robot interaction approach that progressively learns multiple object classifiers using scanty human supervision. Given an input video stream recorded during the human-robot interaction, the user just needs to annotate a small fraction of frames to compute object specific classifiers based on random ferns which share the same features. The resulting methodology is fast (in a few seconds, complex object appearances can be learned), versatile (it can be applied to unconstrained scenarios), scalable (real experiments show we can model up to 30 different object classes), and minimizes the amount of human intervention by leveraging the uncertainty measures associated to each classifier. We thoroughly validate the approach on synthetic data and on real sequences acquired with a mobile platform in indoor and outdoor scenarios containing a multitude of different objects. We show that with little human assistance, we are able to build object classifiers robust to viewpoint changes, partial occlusions, varying lighting and cluttered backgrounds.

A Bayesian Approach to Simultaneously Recover Camera Pose and Non-Rigid Shape from Monocular Images 
F.Moreno-Noguer and J.Porta
Image and Vision Computing (IVC), 2016

@article{Moreno_ivc2016,
title = {A Bayesian Approach to Simultaneously Recover Camera Pose and Non-Rigid Shape from Monocular Images},
author = {F. Moreno-Noguer and J. Porta},
booktitle = {Image and Vision Computing},
volume = {52},
issn = {0262-8856},
pages = {141-153},
doi = {10.1016/j.imavis.2016.05.012},
year = {2016},
month = {August}
}

In this paper we bring the tools of the Simultaneous Localization and Map Building (SLAM) problem from a rigid to a deformable domain and use them to simultaneously recover the 3D shape of non-rigid surfaces and the sequence of poses of a moving camera. Under the assumption that the surface shape may be represented as a weighted sum of deformation modes, we show that the problem of estimating the modal weights along with the camera poses, can be probabilistically formulated as a maximum a posteriori estimate and solved using an iterative least squares optimization. In addition, the probabilistic formulation we propose is very general and allows introducing different constraints without requiring any extra complexity. As a proof of concept, we show that local inextensibility constraints that prevent the surface from stretching can be easily integrated. An extensive evaluation on synthetic and real data, demonstrates that our method has several advantages over current non-rigid shape from motion approaches. In particular, we show that our solution is robust to large amounts of noise and outliers and that it does not need to track points over the whole sequence nor to use an initialization close from the ground truth.

MSClique: Multiple Structure Discovery through Maximum Weighted Clique Problem 
G.Sanroma, A.Penate-Sanchez, R.Alquezar, F.Serratosa, F.Moreno-Noguer, J.Andrade-Cetto and M.A.Gonzalez Ballester
PLoS ONE, 2016

@article{Sanroma_plosone2016,
title = {MSClique: Multiple Structure Discovery through Maximum Weighted Clique Problem},
author = {G. Sanroma and A. Penate-Sanchez and R. Alquezar and F. Serratosa and F. Moreno-Noguer and J. Andrade-Cetto and M.A. Gonzalez Ballester},
booktitle = {PLoS ONE},
volume = {11},
number = {1}
doi = {https://doi.org/10.1371/journal.pone.0145846},
year = {2016},
month = {January}
}

We present a novel approach for feature correspondence and multiple structure discovery in computer vision. In contrast to existing methods, we exploit the fact that point-sets on the same structure usually lie close to each other, thus forming clusters in the image. Given a pair of input images, we initially extract points of interest and extract hierarchical representations by agglomerative clustering. We use the maximum weighted clique problem to find the set of corresponding clusters with maximum number of inliers representing the multiple structures at the correct scales. Our method is parameter-free and only needs two sets of points along with their tentative correspondences, thus being extremely easy to use. We demonstrate the effectiveness of our method in multiple-structure fitting experiments in both publicly available and in-house datasets. As shown in the experiments, our approach finds a higher number of structures containing fewer outliers compared to state-of-the-art methods.

Conference

Accurate and Linear Time Pose Estimation from Points and Lines  
A.Vakhitov, J.Funke and F.Moreno-Noguer 
European Conference on Computer Vision (ECCV), 2016

@inproceedings{Vakhitov_eccv2016,
title = {Accurate and Linear Time Pose Estimation from Points and Lines},
author = {A. Vakhitov and J. Funke and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2016}
}

The Perspective-n-Point (PnP) problem seeks to estimate the pose of a calibrated camera from n 3D-to-2D point correspondences. There are situations, though, where PnP solutions are prone to fail because feature point correspondences cannot be reliably estimated (e.g. scenes with repetitive patterns or with low texture). In such scenarios, one can still exploit alternative geometric entities, such as lines, yielding the so-called Perspective-n-Line (PnL) algorithms. Unfortunately, existing PnL solutions are not as accurate and efficient as their point-based counterparts. In this paper we propose a novel approach to introduce 3D-to-2D line correspondences into a PnP formulation, allowing to simultaneously process points and lines. For this purpose we introduce an algebraic line error that can be formulated as linear constraints on the line endpoints, even when these are not directly observable. These constraints can then be naturally integrated within the linear formulations of two state-of-the-art point-based algorithms, the OPnP and the EPnP, allowing them to indistinctly handle points, lines, or a combination of them. Exhaustive experiments show that the proposed formulation brings remarkable boost in performance compared to only point or only line based solutions, with a negligible computational overhead compared to the original OPnP and EPnP.

Recovering Pose and 3D Deformable Shape from Multi-Instance Image Ensembles  
A.Agudo and F.Moreno-Noguer 
Asian Conference on Computer Vision (ACCV), 2016

@inproceedings{Agudo_accv2016,
title = {Recovering Pose and 3D Deformable Shape from Multi-Instance Image Ensembles},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
year = {2016}
}

In recent years, there has been a growing interest on tackling the Non-Rigid Structure from Motion problem (NRSfM), where the shape of a deformable object and the pose of a moving camera are simultaneously estimated from a monocular video sequence. Existing solutions are limited to single objects and continuous, smoothly changing sequences. In this paper we extend NRSfM to a multi-instance domain, in which the images do not need to have temporal consistency, allowing for instance, to jointly reconstruct the face of multiple persons from an unordered list of images. For this purpose, we present a new formulation of the problem based on a dual low-rank shape representation, that simultaneously captures the between- and within-individual deformations. The parameters of this model are learned using a variant of the probabilistic linear discriminant analysis that requires consecutive batches of expectation and maximization steps. The resulting approach estimates 3D deformable shape and pose of multiple instances from only 2D point observations on a collection images, without requiring pre-trained 3D data, and is shown to be robust to noisy measurements and missing points. We provide quantitative and qualitatively evaluation on both synthetic and real data, and show consistent benefits compared to current state-of-the-art.

Structured Prediction with Output Embeddings for Semantic Image Annotation  
A.Quattoni, A.Ramisa, P.Swaroop, E.Simo-Serra and F.Moreno-Noguer 
Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2016

@inproceedings{Quattoni_naacl2016,
title = {Structured Prediction with Output Embeddings for Semantic Image Annotation},
author = {A. Quattoni and A. Ramisa and P.Swaroop and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
year = {2016}
}

We address the task of annotating images with semantic tuples. Solving this problem requires an algorithm able to deal with hundreds of classes for each argument of the tuple. In such contexts, data sparsity becomes a key challenge. We propose handling this sparsity by incorporating feature representations of both the inputs (images) and outputs (argument classes) into a factorized log-linear model.

Mode-Shape Interpretation: Re-Thinking Modal Space for Recovering Deformable Shapes  
A.Agudo, F.Moreno-Noguer, B.Calvo and J.M.M.Montiel 
IEEE Winter Conference on Applications of Computer Vision (WACV), 2016

@inproceedings{Agudo_wacv2016,
title = {Mode-Shape Interpretation: Re-Thinking Modal Space for Recovering Deformable Shapes},
author = {A. Agudo and F. Moreno-Noguer and B. Calvo and J.M.M. Montiel},
booktitle = {Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)},
year = {2016}
}

This paper describes an on-line approach for estimating non-rigid shape and camera pose from monocular video sequences. We assume an initial estimate of the shape at rest to be given and represented by a triangulated mesh, which is encoded by a matrix of the distances between every pair of vertexes. By applying spectral analysis on this matrix, we are then able to compute a low-dimensional shape basis, that in contrast to standard approaches, has a very direct physical interpretation and requires a much smaller number of modes to span a large variety of deformations, either for inextensible or extensible configurations. Based on this low-rank model, we then sequentially retrieve both camera motion and non-rigid shape in each image, optimizing the model parameters with bundle adjustment over a sliding window of image frames. Since the number of these parameters is small, specially when considering physical priors, our approach may potentially achieve real-time performance. Experimental results on real videos for different scenarios demonstrate remarkable robustness to artifacts such as missing and noisy observations.

BASS: Boundary-aware Superpixel Segmentation  
A.Rubio, L.Yu, E.Simo-Serra and F.Moreno-Noguer 
International Conference on Pattern Recognition (ICPR), 2016

@inproceedings{Rubio_icpr2016,
title = {{BASS}: Boundary-aware Superpixel Segmentation},
author = {A. Rubio and L. Yu and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Proceedings of International Conference on Pattern Recognition (ICPR)},
year = {2016}
}

We propose a new superpixel algorithm based on exploiting the boundary information of an image, as objects in images can generally be described by their boundaries. Our proposed approach initially estimates the boundaries and uses them to place superpixel seeds in the areas in which they are more dense. Afterwards, we minimize an energy function in order to expand the seeds into full superpixels. In addition to standard terms such as color consistency and compactness, we propose using the geodesic distance which concentrates small superpixels in regions of the image with more information, while letting larger superpixels cover more homogeneous regions. By both improving the initialization using the boundaries and coherency of the superpixels with geodesic distances, we are able to maintain the coherency of the image structure with fewer superpixels than other approaches. We show the resulting algorithm to yield smaller Variation of Information metrics in seven different datasets while maintaining Undersegmentation Error values similar to the state-of-the-art methods.

Structured Learning of Assignment Models of Neuron Reconstruction to Minimize Topological Errors  
J.Funke, J.Klein, F.Moreno-Noguer, A.Cardona and M.Cook 
International Symposium on biomedical Imaging (ISBI), 2016

@inproceedings{Funke_isbi2016,
title = {Structured Learning of Assignment Models of Neuron Reconstruction to Minimize Topological Errors},
author = {J. Funke and J. Klein and F. Moreno-Noguer and A. Cardona and M. Cook},
booktitle = {Proceedings of the International Symposium on biomedical Imaging (ISBI)},
year = {2016}
}

Structured learning provides a powerful framework for empirical risk minimization on the predictions of structured models. It allows end-to-end learning of model parameters to minimize an application specific loss function. This framework is particularly well suited for discrete optimization models that are used for neuron reconstruction from anisotropic electron microscopy (EM) volumes. However, current methods are still learning unary potentials by training a classifier that is agnostic about the model it is used in. We believe the reason for that lies in the difficulties of (1) finding a representative training sample, and (2) designing an application specific loss function that captures the quality of a proposed solution. In this paper, we show how to find a representative training sample from human generated ground truth, and propose a loss function that is suitable to minimize topological errors in the reconstruction. We compare different training methods on two challenging EM-datasets. Our structured learning approach shows consistently higher reconstruction accuracy than other current learning methods.

2015

Book Chapter

Dense Segmentation-aware Descriptors 
E.Trulls, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer
Chapter in Dense Image Correspondences for Computer Vision, 2015

@article{Trulls_springerchapter2015,
title = {Dense Segmentation-aware Descriptors},
author = {E. Trulls and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Dense Image Correspondences for Computer Vision},
editor = {Ce Liu and Tal Hassner},
publisher = {Springer},
doi = {http://dx.doi.org/10.1007/978-3-319-23048-1},
year = {2015}
}

Dense descriptors are becoming increasingly popular in a host of tasks, such as dense image correspondence, bag-of-words image classification, and label transfer. However the extraction of descriptors on generic image points, rather than select geometric features, e.g. blobs, requires rethinking how to achieve invariance to nuisance parameters. In this work we pursue invariance to occlusions and background changes by introducing segmentation information within dense feature construction. The core idea is to use the segmentation cues to downplay the features coming from image areas that are unlikely to belong to the same region as the feature point. We show how to integrate this idea with dense SIFT, as well as with the dense Scale- and Rotation-Invariant Descriptor (SID). We thereby deliver dense descriptors that are invariant to background changes, rotation and/or scaling. We explore the merit of our technique in conjunction with large displacement motion estimation and wide-baseline stereo, and demonstrate that exploiting segmentation information yields clear improvements.

Journal

Non-Rigid Graph Registration using Active Testing Search 
E.Serradell, M.A.Pinheiro, R.Sznitman, J.Kybic, F.Moreno-Noguer and P.Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2015

@article{Serradell_pami2015,
title = {Non-Rigid Graph Registration using Active Testing Search},
author = {E. Serradell and M.A. Pinheiro and R. Sznitman and J. Kybic and F. Moreno-Noguer and P. Fua},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {37},
number = {3},
issn = {0162-8828},
pages = {625-638},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2014.2343235},
year = {2015},
month = {March}
}

We present a new approach for matching sets of branching curvilinear structures that form graphs embedded in R2 or R3 and may be subject to deformations. Unlike earlier methods, ours does not rely on local appearance similarity nor does require a good initial alignment. Furthermore, it can cope with non-linear deformations, topological differences, and partial graphs. To handle arbitrary non-linear deformations, we use Gaussian Processes to represent the geometrical mapping relating the two graphs. In the absence of appearance information, we iteratively establish correspondences between points, update the mapping accordingly, and use it to estimate where to find the most likely correspondences that will be used in the next step. To make the computation tractable for large graphs, the set of new potential matches considered at each iteration is not selected at random as in many RANSAC-based algorithms. Instead, we introduce a so-called Active Testing Search strategy that performs a priority search to favor the most likely matches and speed-up the process. We demonstrate the effectiveness of our approach first on synthetic cases and then on angiography data, retinal fundus images, and microscopy image stacks acquired at very different resolutions.

DaLI: Deformation and Light Invariant Descriptor
E.Simo-Serra, C.Torras and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2015

@article{Simo_ijcv2015,
title = {{DaLI}: Deformation and Light Invariant Descriptor},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {115},
number = {2},
issn = {0920-5691},
pages = {135-154},
doi = {https://doi.org/10.1007/s11263-015-0805-1},
year = {2015},
month = {November}
}

Recent advances in 3D shape analysis and recognition have shown that heat diffusion theory can be effectively used to describe local features of deforming and scaling surfaces. In this paper, we show how this description can be used to characterize 2D image patches, and introduce DaLI, a novel feature point descriptor with high resilience to non-rigid image transformations and illumination changes. In order to build the descriptor, 2D image patches are initially treated as 3D surfaces. Patches are then described in terms of a heat kernel signature, which captures both local and global information, and shows a high degree of invariance to non-linear image warps. In addition, by further applying a logarithmic sampling and a Fourier transform, invariance to photometric changes is achieved. Finally, the descriptor is compacted by mapping it onto a low dimensional subspace computed using Principal Component Analysis, allowing for an efficient matching. A thorough experimental validation demonstrates that DaLI is significantly more discriminative and robust to illuminations changes and image transformations than state of the art descriptors, even those specifically designed to describe non-rigid deformations.

Conference

Discriminative Learning of Deep Convolutional Feature Point Descriptors  
E.Simo-Serra, E.Trulls, L.Ferraz, I.Kokkinos, P.Fua and F.Moreno-Noguer 
International Conference in Computer Vision (ICCV), 2015

@inproceedings{Simo_iccv2015,
title = {Discriminative Learning of Deep Convolutional Feature Point Descriptors},
author = {E. Simo-Serra and E. Trulls and L. Ferraz and I. Kokkinos and P. Fua and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2015}
}

Deep learning has revolutionalized image-level tasks such as classification, but patch-level tasks, such as correspondence, still rely on handcrafted features, e.g. SIFT. In this paper we use Convolutional Neural Networks (CNNs) to learn discriminant patch representations and in particular train a Siamese network with pairs of (non-)corresponding patches. We deal with the large number of potential pairs with the combination of a stochastic sampling of the training set and an aggressive mining strategy biased towards patches that are hard to classify. By using the L2 distance during both training and testing we develop 128-D descriptors whose euclidean distances reflect patch similarity, and which can be used as a drop-in replacement for any task involving SIFT. We demonstrate consistent performance gains over the state of the art, and generalize well against scaling and rotation, perspective transformation, non-rigid deformation, and illumination changes. Our descriptors are efficient to compute and amenable to modern GPUs, and are publicly available.

Learning Shape, Motion and Elastic Models in Force Space  
A.Agudo and F.Moreno-Noguer 
International Conference in Computer Vision (ICCV), 2015

@inproceedings{Agudo_iccv2015,
title = {Learning Shape, Motion and Elastic Models in Force Space},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2015}
}

In this paper, we address the problem of simultaneously recovering the 3D shape and pose of a deformable and potentially elastic object from 2D motion. This is a highly ambiguous problem typically tackled by using low-rank shape and trajectory constraints. We show that formulating the problem in terms of a low-rank force space that induces the deformation, allows for a better physical interpretation of the resulting priors and a more accurate representation of the actual object’s behavior. However, this comes at the price of, besides force and pose, having to estimate the elastic model of the object. For this, we use an Expectation Maximization strategy, where each of these parameters are successively learned within partial M-steps, while robustly dealing with missing observations. We thoroughly validate the approach on both mocap and real sequences, showing more accurate 3D reconstructions than state-of-the-art, and additionally providing an estimate of the full elastic model with no a priori information.

Simultaneous Pose and Non-rigid Shape with Particle Dynamics  
A.Agudo and F.Moreno-Noguer 
Conference on Computer Vision and Pattern Recognition (CVPR), 2015

@inproceedings{Agudo_cvpr2015,
title = {Simultaneous Pose and Non-rigid Shape with Particle Dynamics},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2015}
}

In this paper, we propose a sequential solution to simultaneously estimate camera pose and non-rigid 3D shape from a monocular video. In contrast to most existing approaches that rely on global representations of the shape, we model the object at a local level, as an ensemble of particles, each ruled by the linear equation of the Newton's second law of motion. This dynamic model is incorporated into a bundle adjustment framework, in combination with simple regularization components that ensure temporal and spatial consistency of the estimated shape and camera poses. The resulting approach is both efficient and robust to several artifacts such as noisy and missing data or sudden camera motions, while it does not require any training data at all. Validation is done in a variety of real video sequences, including articulated and non-rigid motion, both for continuous and discontinuous shapes. Our system is shown to perform comparable to competing batch, computationally expensive, methods and shows remarkable improvement with respect to the sequential ones.

Neuroaesthetics in Fashion: High Performance CRF Model for Cloth Parsing  
E.Simo-Serra, S.Fidler, F.Moreno-Noguer and R.Urtasun 
Conference on Computer Vision and Pattern Recognition (CVPR), 2015

@inproceedings{Simo_cvpr2015,
title = {Neuroaesthetics in Fashion: High Performance CRF Model for Cloth Parsing},
author = {E. Simo-Serra and S. Fidler and F. Moreno-Noguer and R. Urtasun},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2015}
}

In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph’s setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.

Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions  
A.Ramisa, J.Wang, Y.Lu, E.Dellandrea, F.Moreno-Noguer and R.Gaizauskas 
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015

@inproceedings{Ramisa_emnlp2015,
title = {Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions},
author = {A. Ramisa and J. Wang and Y. Lu and E. Dellandrea and F. Moreno-Noguer and R. Gaizauskas},
booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2015}
}

We investigate the role that geometric, textual and visual features play in the task of predicting a preposition that links two visual entities depicted in an image. The task is an important part of the subsequent process of generating image descriptions. We explore the prediction of prepositions for a pair of entities, both in the case when the labels of such entities are known and unknown. In all situations we found clear evidence that all three features contribute to the prediction task.

Matchability Prediction for Full-Search Template Matching Algorithms  
A.Penate, L.Porzi and F.Moreno-Noguer 
International Conference on 3D Vision (3DV), 2015

@inproceedings{Penate_3dv2015,
title = {Matchability Prediction for Full-Search Template Matching Algorithms},
author = {A. Penate and L. Porzi and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on 3D Vision (3DV)},
year = {2015}
}

While recent approaches have shown that it is possible to do template matching by exhaustively scanning the parameter space, the resulting algorithms are still quite demanding. In this paper we alleviate the computational load of these algorithms by proposing an efficient approach for predicting the matchability of a template, before it is actually performed. This avoids large amounts of unnecessary computations. We learn the matchability of templates by using dense convolutional neural network descriptors that do not require ad-hoc criteria to characterize a template. By using deep learning descriptions of patches we are able to predict matchability over the whole image quite reliably. We will also show how no specific training data is required to solve problems like panorama stitching in which you usually require data from the scene in question. Due to the highly parallelizable nature of this tasks we offer an efficient technique with a negligible computational cost at test time.

Lie Algebra-Based Kinematic Prior for 3D Human Pose Tracking   (Best Paper Award)
E.Simo-Serra, C.Torras and F.Moreno-Noguer 
International Conference on Machine Vision Applications (MVA), 2015

@inproceedings{Simo_mva2015,
title = {Lie Algebra-Based Kinematic Prior for 3D Human Pose Tracking},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Machine Vision Applications (MVA)},
year = {2015}
}

We propose a novel kinematic prior for 3D human pose tracking that allows predicting the position in subsequent frames given the current position. We first define a Riemannian manifold that models the pose and extend it with its Lie algebra to also be able to represent the kinematics. We then learn a joint Gaussian mixture model of both the human pose and the kinematics on this manifold. Finally by conditioning the kinematics on the pose we are able to obtain a distribution of poses for subsequent frames that which can be used as a reliable prior in 3D human pose tracking. Our model scales well to large amounts of data and can be sampled at over 100,000 samples/second. We show it outperforms the widely used Gaussian diffusion model on the challenging Human3.6M dataset.

Multimodal Object Classification using Random Clustering Trees   (Best Poster Award)
M.Villamizar, A.Garrell, A.Sanfeliu and F.Moreno-Noguer 
Iberian Conference on Pattern Recognition and Image Analysis (IBPRIA), 2015

@inproceedings{Villamizar_ibpria2015,
title = {Multimodal Object Classification using Random Clustering Trees},
author = {M. Villamizar and A. Garrell and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis (IBPRIA)},
year = {2015}
}

In this paper, we present an object recognition approach that in addition allows to discover intra-class modalities exhibiting high-correlated visual information. Unlike to more conventional approaches based on computing multiple specialized classifiers, the proposed approach combines a single classifier, Boosted Random Ferns (BRFs), with probabilistic Latent Semantic Analysis (pLSA) in order to recognize an object class and to and automatically the most prominent intra-class appearance modalities (clusters) through tree-structured visual words. The proposed approach has been validated in synthetic and real experiments where we show that the method is able to recognize objects with multiple appearances.

Modeling Robot’s World with Minimal Effort  
M.Villamizar, A.Garrell, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Robotics and Automation (ICRA), 2015

@inproceedings{Villamizar_icra2015,
title = {Modeling Robot’s World with Minimal Effort},
author = {M. Villamizar and A. Garrell and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2015}
}

We propose an efficient Human Robot Interaction approach to efficiently model the appearance of all relevant objects in robot’s environment. Given an input video stream recorded while the robot is navigating, the user just needs to annotate a very small number of frames to build specific classifiers for each of the objects of interest. At the core of the method, there are several random ferns classifiers that share the same features and are updated online. The resulting methodology is fast (runs at 8 fps), versatile (it can be applied to unconstrained scenarios), scalable (real experiments show we can model up to 30 different object classes), and minimizes the amount of human intervention by leveraging the uncertainty measures associated to each classifier. We thoroughly validate the approach on synthetic data and on real sequences acquired with a mobile platform in outdoor and challenging scenarios containing a multitude of different objects. We show that the human can, with minimal effort, provide the robot with a detailed model of the objects in the scene.

Efficient Monocular Pose Estimation for Complex 3D Models  
A.Rubio, M.Villamizar, L.Ferraz, A.Penate-Sanchez, A.Ramisa, E.Simo-Serra, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Robotics and Automation (ICRA), 2015

@inproceedings{Rubio_icra2015,
title = {Efficient Monocular Pose Estimation for Complex 3D Models},
author = {A. Rubio and M. Villamizar and L. Ferraz and A. Penate-Sanchez and A. Ramisa and E. Simo-Serra and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2015}
}

We propose a robust and efficient method to estimate the pose of a camera with respect to complex 3D textured models of the environment that can potentially contain more than 100,000 points. To tackle this problem we follow a top down approach where we combine high-level deep network classifiers with low level geometric approaches to come up with a solution that is fast, robust and accurate. Given an input image, we initially use a pre-trained deep network to compute a rough estimation of the camera pose. This initial estimate constrains the number of 3D model points that can be seen from the camera viewpoint. We then establish 3D-to-2D correspondences between these potentially visible points of the model and the 2D detected image features. Accurate pose estimation is finally obtained from the 2D-to-3D correspondences using a novel PnP algorithm that rejects outliers without the need to use a RANSAC strategy, and which is between 10 and 100 times faster than other methods that use it. Two real experiments dealing with very large and complex 3D models demonstrate the effectiveness of the approach.

Workshop

Semantic Tuples for Evaluation of Image Sentence Generation  
L.Ellebracht, A.Ramisa, P.Swaroop, J.Cordero-Rama, F.Moreno-Noguer and A.Quattoni 
Vision and Language Workshop (in EMNLP), 2015

@inproceedings{Ellebracht_vl2015,
title = {Semantic Tuples for Evaluation of Image Sentence Generation},
author = {L. Ellebracht and A. Ramisa and P. Swaroop and J. Cordero-Rama and F. Moreno-Noguer and A. Quattoni},
booktitle = {Vision and Language Workshop (in EMNLP)},
year = {2015}
}

The automatic generation of image captions has received considerable attention. The problem of evaluating caption gener- ation systems, though, has not been that much explored. We propose a novel evaluation approach based on comparing the underlying visual semantics of the candidate and ground-truth captions. With this goal in mind we have defined a semantic representation for visually descriptive language and have augmented a subset of the Flickr-8K dataset with semantic annotations. Our evaluation metric (BAST) can be used not only to compare systems but also to do error analysis and get a better understanding of the type of mistakes a system does. To compute BAST we need to predict the semantic representation for the automatically generated captions. We use the Flickr-ST dataset to train classifiers that predict STs so that evaluation can be fully automated.

2014

Journal

Learning RGB-D Descriptors of Garment Parts for Informed Grasping
A.Ramisa, G.Alenya, F.Moreno-Noguer and C.Torras
Engineering Applications of Artificial Intelligence (EEAI), 2014

@article{Ramisa_eeai2014,
title = {Learning RGB-D Descriptors of Garment Parts for Informed Grasping},
author = {A. Ramisa and G. Alenya and F. Moreno-Noguer and C. Torras},
booktitle = {Engineering Applications of Artificial Intelligence (EEAI)},
volume = {35},
issn = {0952-1976},
pages = {246-258},
doi = {10.1016/j.engappai.2014.06.025},
year = {2014},
month = {October}
}

Robotic handling of textile objects in household environments is an emerging application that has recently received considerable attention thanks to the development of domestic robots. Most current approaches follow a multiple re-grasp strategy for this purpose, in which clothes are sequentially grasped from different points until one of them yields a desired configuration. In this work we propose a vision-based method, built on the Bag of Visual Words approach, that combines appearance and 3D information to detect parts suitable for grasping in clothes, even when they are highly wrinkled. We also contribute a new, annotated, garment part dataset that can be used for benchmarking classification, part detection, and segmentation algorithms. The dataset is used to evaluate our approach and several state-of-the-art 3D descriptors for the task of garment part detection. Results indicate that appearance is a reliable source of information, but that augmenting it with 3D information can help the method perform better with new clothing items.

Conference

Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection  
L.Ferraz, X.Binefa and F.Moreno-Noguer 
Conference on Computer Vision and Pattern Recognition (CVPR), 2014

@inproceedings{Ferraz_cvpr2014,
title = {Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection},
author = {L. Ferraz and X. Binefa and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {501-508},
year = {2014}
}

We propose a real-time, robust to outliers and accurate solution to the Perspective-n-Point (PnP) problem. The main advantages of our solution are twofold: first, it integrates the outlier rejection within the pose estimation pipeline with a negligible computational overhead; and second, its scalability to arbitrarily large number of correspondences. Given a set of 3D-to-2D matches, we formulate pose estimation problem as a low-rank homogeneous system where the solution lies on its 1D null space. Outlier correspondences are those rows of the linear system which perturb the null space and are progressively detected by projecting them on an iteratively estimated solution of the null space. Since our outlier removal process is based on an algebraic criterion which does not require computing the full-pose and reprojecting back all 3D points on the image plane at each step, we achieve speed gains of more than 100× compared to RANSAC strategies. An extensive experimental evaluation will show that our solution yields accurate results in situations with up to 50% of outliers, and can process more than 1000 correspondences in less than 5ms.

Segmentation-aware Deformable Part Models  
E.Trulls, S.Tsogkas, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer 
Conference on Computer Vision and Pattern Recognition (CVPR), 2014

@inproceedings{Trulls_cvpr2014,
title = {Segmentation-aware Deformable Part Models},
author = {E. Trulls and S. Tsogkas and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {168-175}
year = {2014}
}

In this work we propose a technique to combine bottom-up segmentation, coming in the form of SLIC superpixels, with sliding window detectors, such as Deformable Part Models (DPMs). The merit of our approach lies in ‘cleaning up’ the low-level HOG features by exploiting the spatial support of SLIC superpixels; this can be understood as using segmentation to split the feature variation into object-specific and background changes. Rather than committing to a single segmentation we use a large pool of SLIC superpixels and combine them in a scale-, position- and object-dependent manner to build soft segmentation masks. The segmentation masks can be computed fast enough to repeat this process over every candidate window, during training and detection, for both the root and part filters of DPMs. We use these masks to construct enhanced, background-invariant features to train DPMs. We test our approach on the PASCAL VOC 2007, outperforming the standard DPM in 17 out of 20 classes, yielding an average increase of 1.7% AP. Additionally, we demonstrate the robustness of this approach, extending it to dense SIFT descriptors for large displacement optical flow.

A High Performance CRF Model for Cloth Parsing  
E.Simo-Serra, S.Fidler, F.Moreno-Noguer and R.Urtasun 
Asian Conference on Computer Vision (ACCV), 2014

@inproceedings{Simo_accv2014,
title = {A High Performance CRF Model for Cloth Parsing},
author = {E. Simo-Serra and S. Fidler and F. Moreno-Noguer and R. Urtasun},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
year = {2014}
}

In this paper we tackle the problem of clothing parsing: Our goal is to segment and classify different garments a person is wearing. We frame the problem as the one of inference in a pose-aware Conditional Random Field (CRF) which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments, and symmetries between different human body parts. We demonstrate the effectiveness of our approach on the Fashionista dataset and show that we can obtain a significant improvement over the state-of-the-art.

LETHA: Learning from High Quality Inputs for 3D Pose Estimation in Low Quality Images  
A.Penate, F.Moreno-Noguer, J.Andrade and F.Fleuret 
International Conference on 3D Vision (3DV), 2014

@inproceedings{Penate_3dv2014,
title = {LETHA: Learning from High Quality Inputs for 3D Pose Estimation in Low Quality Images},
author = {A. Penate and F. Moreno-Noguer and J. Andrade and F. Fleuret},
booktitle = {Proceedings of the International Conference on 3D Vision (3DV)},
year = {2014}
}

We introduce LETHA (Learning on Easy data, Test on Hard), a new learning paradigm consisting of building strong priors from high quality training data, and combining them with discriminative machine learning to deal with low-quality test data. Our main contribution is an implementation of that concept for pose estimation. We first automatically build a 3D model of the object of interest from high-definition images, and devise from it a pose-indexed feature extraction scheme. We then train a single classifier to process these feature vectors. Given a low quality test image, we visit many hypothetical poses, extract features consistently and evaluate the response of the classifier. Since this process uses locations recorded during learning, it does not require matching points anymore. We use a boosting procedure to train this classifier common to all poses, which is able to deal with missing features, due in this context to self-occlusion. Our results demonstrate that the method combines the strengths of global image representations, discriminative even for very tiny images, and the robustness to occlusions of approaches based on local feature point descriptors.

Leveraging Feature Uncertainty in the PnP Problem  
L.Ferraz, X.Binefa and F.Moreno-Noguer 
British Machine Vision Conference (BMVC), 2014

@inproceedings{Ferraz_bmvc2014,
title = {Leveraging Feature Uncertainty in the PnP Problem},
author = {L. Ferraz and X. Binefa and F. Moreno-Noguer},
booktitle = {Proceedings of the British Machine Vision Conference (BMVC)},
year = {2014}
}

We propose a real-time and accurate solution to the Perspective-n-Point (PnP) problem –estimating the pose of a calibrated camera from n 3D-to-2D point correspondences– that exploits the fact that in practice the 2D position of not all 2D features is estimated with the same accuracy. Assuming a model of such feature uncertainties is known in advance, we reformulate the PnP problem as a maximum likelihood minimization approximated by an unconstrained Sampson error function, which naturally penalizes the most noisy correspondences. The advantages of this approach are thoroughly demonstrated in synthetic experiments where feature uncertainties are exactly known. Pre-estimating the features uncertainties in real experiments is, though, not easy. In this paper we model feature uncertainty as 2D Gaussian distributions representing the sensitivity of the 2D feature detectors to different camera viewpoints. When using these noise models with our PnP formulation we still obtain promising pose estimation results that outperform the most recent approaches.

Geodesic Finite Mixture Models  
E.Simo-Serra, C.Torras and F.Moreno-Noguer 
British Machine Vision Conference (BMVC), 2014

@inproceedings{Simo_bmvc2014,
title = {Geodesic Finite Mixture Models},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {Proceedings of the British Machine Vision Conference (BMVC)},
year = {2014}
}

We present a novel approach for learning a finite mixture model on a Riemannian manifold in which Euclidean metrics are not applicable and one needs to resort to geodesic distances consistent with the manifold geometry. For this purpose, we draw inspiration on a variant of the expectation-maximization algorithm, that uses a minimum message length criterion to automatically estimate the optimal number of components from multivariate data lying on an Euclidean space. In order to use this approach on Riemannian manifolds, we propose a formulation in which each component is defined on a different tangent space, thus avoiding the problems associated with the loss of accuracy produced when linearizing the manifold with a single tangent space. Our approach can be applied to any type of manifold for which it is possible to estimate its tangent space. In particular, we show results on synthetic examples of a sphere and a quadric surface, and on a large and complex dataset of human poses, where the proposed model is used as a regression tool for hypothesizing the geometry of occluded parts of the body.

On-board Real-time Pose Estimation for UAVs using Deformable Visual Contour Registration  
A.Amor-Martinez, A.Ruiz, F.Moreno-Noguer and A.Sanfeliu 
International Conference on Robotics and Automation (ICRA), 2014

@inproceedings{Amor_icra2014,
title = {On-board Real-time Pose Estimation for UAVs using Deformable Visual Contour Registration},
author = {A. Amor-Martinez and A. Ruiz and F. Moreno-Noguer and A. Sanfeliu},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2014}
}

We present a real time algorithm for estimating the pose of non-planar objects on which we have placed a visual marker. It is designed to overcome the limitations of small aerial robots, such as slow CPUs, low image resolution and geometric distortions produced by wide angle lenses or viewpoint changes. The method initially registers the shape of a known marker to the contours extracted in an image. For this purpose, and in contrast to state-of-the art, we do not seek to match textured patches or points of interest. Instead, we optimize a geometric alignment cost computed directly from raw polygonal representations of the observed regions using very simple and efficient clipping algorithms. Further speed is achieved by performing the optimization in the polygon representation space, avoiding the need of 2D image processing operations. Deformation modes are easily included in the optimization scheme, allowing an accurate registration of different markers attached to curved surfaces using a single deformable prototype. Once this initial registration is solved, the object pose is retrieved using a standard PnP approach. As a result, the method achieves accurate object pose estimation in real-time, which is very important for interactive UAV tasks, for example for short distance surveillance or bar assembly. We present experiments where our method yields, at about 30Hz, an average error of less than 5mm in estimating the position of a 19x19mm marker placed at 0.7m of the camera.

Fast Online Learning and Detection of Natural Landmarks for Autonomous Aerial Robots  
M.Villamizar, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Robotics and Automation (ICRA), 2014

@inproceedings{Villamizar_icra2014,
title = {Fast Online Learning and Detection of Natural Landmarks for Autonomous Aerial Robots},
author = {M. Villamizar and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2014}
}

We present a method for efficiently detecting natural landmarks that can handle scenes with highly repetitive patterns and targets progressively changing its appearance. At the core of our approach lies a Random Ferns classifier, that models the posterior probabilities of different views of the target using multiple and independent Ferns, each containing features at particular positions of the target. A Shannon entropy measure is used to pick the most informative locations of these features. This minimizes the number of Ferns while maximizing its discriminative power, allowing thus, for robust detections at low computational costs. In addition, after offline initialization, the new incoming detections are used to update the posterior probabilities on the fly, and adapt to changing appearances that can occur due to the presence of shadows or occluding objects. All these virtues, make the proposed detector appropriate for UAV navigation. Besides the synthetic experiments that will demonstrate the theoretical benefits of our formulation, we will show applications for detecting landing areas in regions with highly repetitive patterns, and specific objects under the presence of cast shadows or sudden camera motions.

Workshop

Efficient Monocular 3D Pose Estimation using Complex 3D Models   (Best Paper Award)
A.Rubio, M.Villamizar, L.Ferraz, A.Penate, A.Sanfeliu and F.Moreno-Noguer 
Jornadas de Automatica, 2014

@inproceedings{Rubio_jornadas2014,
title = {Efficient Monocular 3D Pose Estimation using Complex 3D Models},
author = {A. Rubio and M. Villamizar and L. Ferraz and A. Penate and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Jornadas de Automatica},
year = {2014}
}

We present a method for efficiently detecting natural landmarks that can handle scenes with highly repetitive patterns and targets progressively changing its appearance. At the core of our approach lies a Random Ferns classifier, that models the posterior probabilities of different views of the target using multiple and independent Ferns, each containing features at particular positions of the target. A Shannon entropy measure is used to pick the most informative locations of these features. This minimizes the number of Ferns while maximizing its discriminative power, allowing thus, for robust detections at low computational costs. In addition, after offline initialization, the new incoming detections are used to update the posterior probabilities on the fly, and adapt to changing appearances that can occur due to the presence of shadows or occluding objects. All these virtues, make the proposed detector appropriate for UAV navigation. Besides the synthetic experiments that will demonstrate the theoretical benefits of our formulation, we will show applications for detecting landing areas in regions with highly repetitive patterns, and specific objects under the presence of cast shadows or sudden camera motions.

2013

Journal

Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation 
A.Penate-Sanchez, J.Andrade-Cetto and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013

@article{Penate_pami2013,
title = {Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation},
author = {A. Penate-Sanchez and J. Andrade-Cetto and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {35},
number = {10},
issn = {0162-8828},
pages = {2387-2400},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.36},
year = {2013},
month = {October}
}

We present a general approach for solving the point-cloud matching problem for the case of mildly nonlinear transformations. Our method quickly finds a coarse approximation of the solution by exploring a reduced set of partial matches using an approach to which we refer to as Active Testing Search (ATS). We apply the method to registration of graph structures by branching point matching. It is based solely on the geometric position of the points, no additional information is used nor the knowledge of an initial alignment. In the second stage, we use dynamic programming to refine the solution. We tested our algorithm on angiography, retinal fundus, and neuronal data gathered using electron and light microscopy. We show that our method solves cases not solved by most approaches, and is faster than the remaining ones.

Stochastic Exploration of Ambiguities for Nonrigid Shape Recovery 
F.Moreno-Noguer and P.Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013

@article{Moreno_pami2013,
title = {Stochastic Exploration of Ambiguities for Nonrigid Shape Recovery},
author = {F. Moreno-Noguer and P. Fua},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {35},
number = {2},
issn = {0162-8828},
pages = {463-475},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2012.102},
year = {2013},
month = {February}
}

Recovering the 3D shape of deformable surfaces from single images is known to be a highly ambiguous problem because many different shapes may have very similar projections. This is commonly addressed by restricting the set of possible shapes to linear combinations of deformation modes and by imposing additional geometric constraints. Unfortunately, because image measurements are noisy, such constraints do not always guarantee that the correct shape will be recovered. To overcome this limitation, we introduce a stochastic sampling approach to efficiently explore the set of solutions of an objective function based on point correspondences. This allows to propose a small set of ambiguous candidate 3D shapes and then use additional image information to choose the best one. As a proof of concept, we use either motion or shading cues to this end and show that we can handle a complex objective function without having to solve a difficult non-linear minimization problem. The advantages of our method are demonstrated on a variety of problems including both real and synthetic data.

Conference

Dense Segmentation-Aware Descriptors  
E.Trulls, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer 
Conference on Computer Vision and Pattern Recognition (CVPR), 2013

@inproceedings{Trulls_cvpr2013,
title = {Dense Segmentation-Aware Descriptors},
author = {E. Trulls and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {2890-2897}
year = {2013}
}

In this work we exploit segmentation to construct appearance descriptors that can robustly deal with occlusion and background changes. For this, we downplay measurements coming from areas that are unlikely to belong to the same region as the descriptor’s center, as suggested by soft segmentation masks. Our treatment is applicable to any image point, i.e. dense, and its computational overhead is in the order of a few seconds. We integrate this idea with Dense SIFT, and also with Dense Scale and Rotation Invariant Descriptors (SID), delivering descriptors that are densely computable, invariant to scaling and rotation, and robust to background changes. We apply our approach to standard benchmarks on large displacement motion estimation using SIFT-flow and wide-baseline stereo, systematically demonstrating that the introduction of segmentation yields clear improvements.

A Joint Model for 2D and 3D Pose Estimation from a Single Image  
E.Simo-Serra, A.Quattoni, C.Torras and F.Moreno-Noguer 
Conference on Computer Vision and Pattern Recognition (CVPR), 2013

@inproceedings{Simo_cvpr2013,
title = {A Joint Model for 2D and 3D Pose Estimation from a Single Image},
author = {E. Simo-Serra and A. Quattoni and C. Torras and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {3634-3641}
year = {2013}
}

We introduce a novel approach to automatically recover 3D human pose from a single image. Most previous work follows a pipelined approach: initially, a set of 2D features such as edges, joints or silhouettes are detected in the image, and then these observations are used to infer the 3D pose. Solving these two problems separately may lead to erroneous 3D poses when the feature detector has performed poorly. In this paper, we address this issue by jointly solving both the 2D detection and the 3D inference problems. For this purpose, we propose a Bayesian framework that integrates a generative model based on latent variables and discriminative 2D part detectors based on HOGs, and perform inference using evolutionary algorithms. Real experimentation demonstrates competitive results, and the ability of our methodology to provide accurate 2D and 3D pose estimations even when the 2D detectors are inaccurate.

Simultaneous Pose, Focal Length and 2D-to-3D Correspondences from Noisy Observations  
A.Penate-Sanchez, E.Serradell, J.Andrade-Cetto and F.Moreno-Noguer 
British Machine Vision Conference (BMVC), 2013

@inproceedings{Penate_bmvc2013,
title = {Simultaneous Pose, Focal Length and 2D-to-3D Correspondences from Noisy Observations},
author = {A. Penate-Sanchez and E. Serradell and J. Andrade-Cetto and F. Moreno-Noguer},
booktitle = {Proceedings of the British Machine Vision Conference (BMVC)},
pages = {3634-3641}
year = {2013}
}

Simultaneously recovering the camera pose and correspondences between a set of 2D-image and 3D-model points is a difficult problem, especially when the 2D-3D matches cannot be established based on appearance only. The problem becomes even more challenging when input images are acquired with an uncalibrated camera with varying zoom, which yields strong ambiguities between translation and focal length. We present a solution to this problem using only geometrical information. Our approach owes its robustness to an initial stage in which the joint pose and focal length solution space is split into several Gaussian regions. At runtime, each of these regions is explored using an hypothesize-and-test approach, in which the potential number of 2D-3D matches is progressively reduced using informed search through Kalman updates, iteratively refining the pose and focal length parameters. The technique is exhaustive but efficient, significantly improving previous methods in terms of robustness to outliers and noise.

Active Testing Search for Point Cloud Matching  
M.Pinheiro, R.Sznitman, E.Serradell, J.Kybic, F.Moreno-Noguer and P.Fua 
Information Processing in Medical Imaging (IPMI), 2013

@inproceedings{Pinheiro_ipmi2013,
title = {Active Testing Search for Point Cloud Matching},
author = {M. Pinheiro and R. Sznitman and E. Serradell and J. Kybic and F. Moreno-Noguer and P. Fua},
booktitle = {Proceedings of the Information Processing in Medical Imaging (IPMI)},
pages = {3634-3641}
year = {2013}
}

Simultaneously recovering the camera pose and correspondences between a set of 2D-image and 3D-model points is a difficult problem, especially when the 2D-3D matches cannot be established based on appearance only. The problem becomes even more challenging when input images are acquired with an uncalibrated camera with varying zoom, which yields strong ambiguities between translation and focal length. We present a solution to this problem using only geometrical information. Our approach owes its robustness to an initial stage in which the joint pose and focal length solution space is split into several Gaussian regions. At runtime, each of these regions is explored using an hypothesize-and-test approach, in which the potential number of 2D-3D matches is progressively reduced using informed search through Kalman updates, iteratively refining the pose and focal length parameters. The technique is exhaustive but efficient, significantly improving previous methods in terms of robustness to outliers and noise.