Selected Publications (Switch to 'All Publications')

Note: The top three computer vision conferences (CVPR/ICCV/ECCV) are highly competitive, with low acceptance rates: 20-30%. CVPR is ranked #1 in Google Scholar Metrics among all journals and conferences in Computer Vision & Pattern Recognition.

2018

Robust Spatio-Temporal Clustering and Reconstruction of Multiple Deformable Bodies 
A.Agudo and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2018

Paper  Abstract  Bibtex

@article{Agudo_pami2018,
title = {Robust Spatio-Temporal Clustering and Reconstruction of Multiple Deformable Bodies},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {},
number = {},
issn = {},
pages = {},
doi = {},
year = {2018}
}

In this paper we present an approach to reconstruct the 3D shape of multiple deforming objects from a collection of sparse, noisy and possibly incomplete 2D point tracks acquired by a single monocular camera. Additionally, the proposed solution estimates the camera motion and reasons about the spatial segmentation (i.e., identifies each of the deforming objects in every frame) and temporal clustering (i.e., splits the sequence into motion primitive actions). This advances competing work, which mainly tackled the problem for one single object and non-occluded tracks. In order to handle several objects at a time from partial observations, we model point trajectories as a union of spatial and temporal subspaces, and optimize the parameters of both modalities, the non-observed point tracks, the camera motion, and the time-varying 3D shape via augmented Lagrange multipliers. The algorithm is fully unsupervised and does not require any training data at all. We thoroughly validate the method on challenging scenarios with several human subjects performing different activities which involve complex motions and close interaction. We show our approach achieves state-of-the-art 3D reconstruction results, while it also provides spatial and temporal segmentation.

Boosted Random Ferns for Object Detection 
M.Villamizar, J.Andrade, A.Sanfeliu and F.Moreno-Noguer 
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2018

Paper  Abstract  Bibtex

@article{Villamizar_pami2018,
title = {Boosted Random Ferns for Object Detection},
author = {M. Villamizar and J. Andrade and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {40},
number = {2},
issn = {0162-8828},
pages = {272 - 288},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2676778},
year = {2018} }

In this paper we introduce the Boosted Random Ferns (BRFs) to rapidly build discriminative classifiers for learning and detecting object categories. At the core of our approach we use standard random ferns, but we introduce four main innovations that let us bring ferns from an instance to a category level, and still retain efficiency. First, we define binary features on the histogram of oriented gradients-domain (as opposed to intensity-), allowing for a better representation of intra-class variability. Second, both the positions where ferns are evaluated within the sliding window, and the location of the binary features for each fern are not chosen completely at random, but instead we use a boosting strategy to pick the most discriminative combination of them. This is further enhanced by our third contribution, that is to adapt the boosting strategy to enable sharing of binary features among different ferns, yielding high recognition rates at a low computational cost. And finally, we show that training can be performed online, for sequentially arriving images. Overall, the resulting classifier can be very efficiently trained, densely evaluated for all image locations in about 0.1 seconds, and provides detection rates similar to competing approaches that require expensive and significantly slower processing times. We demonstrate the effectiveness of our approach by thorough experimentation in publicly available datasets in which we compare against state-of-the-art, and for tasks of both 2D detection and 3D multi-view estimation.

  GANimation: Anatomically-aware Facial Animation from a Single Image     (Oral)
A.Pumarola, A.Agudo, A.M.Martinez, A.Sanfeliu and F.Moreno-Noguer 
European Conference on Computer Vision (ECCV), 2018

Paper  Abstract  Project page  Bibtex

@inproceedings{Pumarola_eccv2018,
title = {GANimation: Anatomically-aware Facial Animation from a Single Image},
author = {A. Pumarola and A. Agudo and A.M. Martinez and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2018}
}

Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs' generation process with images of a specific domain, namely a set of images of persons sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combine several of them. Additionally, we propose a fully unsupervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit attention mechanisms that make our network robust to changing backgrounds and lighting conditions. Extensive evaluation show that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild.

Geometry-Aware Network for Non-Rigid Shape Prediction from a Single View
A.Pumarola, A.Agudo, L.Porzi, A.Sanfeliu, V.Lepetit and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

Paper  Abstract  Project page  Bibtex

@inproceedings{Pumarola_cvpr2018b,
title = {Geometry-Aware Network for Non-Rigid Shape Prediction from a Single View},
author = {A. Pumarola and A. Agudo and L. Porzi and A. Sanfeliu and V. Lepetit and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

We propose a method for predicting the 3D shape of a deformable surface from a single view. By contrast with previous approaches, we do not need a pre-registered template of the surface, and our method is robust to the lack of texture and partial occlusions. At the core of our approach is a geometry-aware deep architecture that tackles the problem as usually done in analytic solutions: first perform 2D detection of the mesh and then estimate a 3D shape that is geometrically consistent with the image. We train this architecture in an end-to-end manner using a large dataset of synthetic renderings of shapes under different levels of deformation, material properties, textures and lighting conditions. We evaluate our approach on a test split of this dataset and available real benchmarks, consistently improving state-of-the-art solutions with a significantly lower computational time.

Unsupervised Person Image Synthesis in Arbitrary Poses     (Spotlight)
A.Pumarola, A.Agudo, A.Sanfeliu and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

Paper  Abstract  Project page  Bibtex

@inproceedings{Pumarola_cvpr2018a,
title = {Unsupervised Person Image Synthesis in Arbitrary Poses},
author = {A. Pumarola and A. Agudo and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

We present a novel approach for synthesizing photorealistic images of people in arbitrary poses using generative adversarial learning. Given an input image of a person and a desired pose represented by a 2D skeleton, our model renders the image of the same person under the new pose, synthesizing novel views of the parts visible in the input image and hallucinating those that are not seen. This problem has recently been addressed in a supervised manner, i.e., during training the ground truth images under the new poses are given to the network. We go beyond these approaches by proposing a fully unsupervised strategy. We tackle this challenging scenario by splitting the problem into two principal subtasks. First, we consider a pose conditioned bidirectional generator that maps back the initially rendered image to the original pose, hence being directly comparable to the input image without the need to resort to any training image. Second, we devise a novel loss function that incorporates content and style terms, and aims at producing images of high perceptual quality. Extensive experiments conducted on the DeepFashion dataset demonstrate that the images rendered by our model are very close in appearance to those obtained by fully supervised approaches.

Image Collection Pop-up: 3D Reconstruction and Clustering of Rigid and Non-Rigid Categories     (Spotlight)
A.Agudo, M.Pijoan and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

Paper  Abstract  Bibtex

@inproceedings{Agudo_cvpr2018,
title = {Image Collection Pop-up: 3D Reconstruction and Clustering of Rigid and Non-Rigid Categories},
author = {A. Agudo and M. Pijoan and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

This paper introduces an approach to simultaneously estimate 3D shape, camera pose, and object and type of deformation clustering, from partial 2D annotations in a multi-instance collection of images. Furthermore, we can indistinctly process rigid and non-rigid categories. This advances existing work, which only addresses the problem for one single object or, if multiple objects are considered, they are assumed to be clustered a priori. To handle this broader version of the problem, we model object deformation using a formulation based on multiple unions of subspaces, able to span from small rigid motion to complex deformations. The parameters of this model are learned via Augmented Lagrange Multipliers, in a completely unsupervised manner that does not require any training data at all. Extensive validation is provided in a wide variety of synthetic and real scenarios, including rigid and non-rigid categories with small and large deformations. In all cases our approach outperforms state-of-the-art in terms of 3D reconstruction accuracy, while also providing clustering results that allow segmenting the images into object instances and their associated type of deformation (or action the object is performing).

2017

Force-based Representation for Non-Rigid Shape and Elastic Model Estimation 
A.Agudo and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2017

Paper  Abstract  Bibtex

@article{Agudo_pami2017,
title = {Force-based Representation for Non-Rigid Shape and Elastic Model Estimation},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {},
number = {},
issn = {},
pages = {},
doi = {},
year = {2017}
}

This paper addresses the problem of simultaneously recovering 3D shape, pose and the elastic model of a deformable object from only 2D point tracks in a monocular video. This is a severely under-constrained problem that has been typically addressed by enforcing the shape or the point trajectories to lie on low-rank dimensional spaces. We show that formulating the problem in terms of a low-rank force space that induces the deformation and introducing the elastic model as an additional unknown, allows for a better physical interpretation of the resulting priors and a more accurate representation of the actual object’s behavior. In order to simultaneously estimate force, pose, and the elastic model of the object we use an expectation maximization strategy, where each of these parameters are successively learned by partial M-steps. Once the elastic model is learned, it can be transfered to similar objects to code its 3D deformation. Moreover, our approach can robustly deal with missing data, and encode both rigid and non-rigid points under the same formalism. We thoroughly validate the approach on Mocap and real sequences, showing more accurate 3D reconstructions than state-of-the-art, and additionally providing an estimate of the full elastic model with no a priori information.

BreakingNews: Article Annotation by Image and Text Processing 
A.Ramisa, F.Yan, F.Moreno-Noguer and K.Mikolajczyk
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2017

Paper  Abstract  Dataset  Bibtex

@article{Ramisa_pami2017,
title = {BreakingNews: Article Annotation by Image and Text Processing},
author = {A. Ramisa and F. Yan and F. Moreno-Noguer and K. Mikolajczyk},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {40},
number = {5},
issn = {0162-8828},
pages = {1072 - 1085},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2721945},
year = {2017}
}

Building upon recent Deep Neural Network architectures, current approaches lying in the intersection of computer vision and natural language processing have achieved unprecedented breakthroughs in tasks like automatic captioning or image retrieval. Most of these learning methods, though, rely on large training sets of images associated with human annotations that specifically describe the visual content. In this paper we propose to go a step further and explore the more complex cases where textual descriptions are loosely related to the images. We focus on the particular domain of News articles in which the textual content often expresses connotative and ambiguous relations that are only suggested but not directly inferred from images. We introduce new deep learning methods that address source detection, popularity prediction, article illustration and geolocation of articles. An adaptive CNN architecture is proposed, that shares most of the structure for all the tasks, and is suitable for multitask and transfer learning. Deep Canonical Correlation Analysis is deployed for article illustration, and a new loss function based on Great Circle Distance is proposed for geolocation. Furthermore, we present BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (such as GPS coordinates and popularity metrics). We show this dataset to be appropriate to explore all aforementioned problems, for which we provide a baseline performance using various Deep Learning architectures, and different representations of the textual and visual features. We report very promising results and bring to light several limitations of current state-of-the-art in this kind of domain, which we hope will help spur progress in the field.

Combining Local-Physical and Global-Statistical Models for Sequential Deformable Shape from Motion 
A.Agudo and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2017

Paper  Abstract  Bibtex

@article{Agudo_ijcv2017,
title = {Combining Local-Physical and Global-Statistical Models for Sequential Deformable Shape from Motion},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {122},
number = {2},
issn = {0920-5691},
pages = {371-387},
doi = {https://doi.org/10.1007/s11263-016-0972-8},
year = {2017}
}

In this paper, we simultaneously estimate camera pose and non-rigid 3D shape from a monocular video, using a sequential solution that combines local and global representations. We model the object as an ensemble of particles, each ruled by the linear equation of the Newton’s second law of motion. This dynamic model is incorporated into a bundle adjustment framework, in combination with simple regularization components that ensure temporal and spatial consistency. The resulting approach allows to sequentially estimate shape and camera poses, while progressively learning a global low-rank model of the shape that is fed back into the optimization scheme, introducing thus, global constraints. The overall combination of local (physical) and global (statistical) constraints yields a solution that is both efficient and robust to several artifacts such as noisy and missing data or sudden camera motions, without requiring any training data at all. Validation is done in a variety of real application domains, including articulated and non-rigid motion, both for continuous and discontinuous shapes. Our on-line methodology yields significantly more accurate reconstructions than competing sequential approaches, being even comparable to the more computationally demanding batch methods.

3D Human Pose Tracking Priors using Geodesic Mixture Models 
E.Simo-Serra, C.Torras and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2017

Paper  Abstract  Project page  Bibtex

@article{Simo_ijcv2017,
title = {3D Human Pose Tracking Priors using Geodesic Mixture Models},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {122},
number = {2},
issn = {0920-5691},
pages = {388-408},
doi = {https://doi.org/10.1007/s11263-016-0941-2},
year = {2017}
}

We present a novel approach for learning a finite mixture model on a Riemannian manifold in which Euclidean metrics are not applicable and one needs to resort to geodesic distances consistent with the manifold geometry. For this purpose, we draw inspiration on a variant of the expectation-maximization algorithm, that uses a minimum message length criterion to automatically estimate the optimal number of components from multivariate data lying on an Euclidean space. In order to use this approach on Riemannian manifolds, we propose a formulation in which each component is defined on a different tangent space, thus avoiding the problems associated with the loss of accuracy produced when linearizing the manifold with a single tangent space. Our approach can be applied to any type of manifold for which it is possible to estimate its tangent space. Additionally, we consider using shrinkage covariance estimation to improve the robustness of the method, especially when dealing with very sparsely distributed samples. We evaluate the approach on a number of situations, going from data clustering on manifolds to combining pose and kinematics of articulated bodies for 3D human pose tracking. In all cases, we demonstrate remarkable improvement compared to several chosen baselines.

3D Human Pose Estimation from a Single Image via Distance Matrix Regression
F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2017

Paper  Abstract  Video  Bibtex

@inproceedings{Moreno_cvpr2017,
title = {3D Human Pose Estimation from a Single Image via Distance Matrix Regression},
author = {F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017}
}

This paper addresses the problem of 3D human pose estimation from a single image. We follow a standard two-step pipeline by first detecting the 2D position of the N body joints, and then using these observations to infer 3D pose. For the first step, we use a recent CNN-based detector. For the second step, most existing approaches perform 2N-to-3N regression of the Cartesian joint coordinates. We show that more precise pose estimates can be obtained by representing both the 2D and 3D human poses using N × N distance matrices, and formulating the problem as a 2D-to-3D distance matrix regression. For learning such a regressor we leverage on simple Neural Network architectures, which by construction, enforce positivity and symmetry of the predicted matrices. The approach has also the advantage to naturally handle missing observations and allowing to hypothesize the position of non-observed joints. Quantitative results on Humaneva and Human3.6M datasets demonstrate consistent performance gains over state-of-the-art. Qualitative evaluation on the images in-the-wild of the LSP dataset, using the regressor learned on Human3.6M, reveals very promising generalization results.

DUST: Dual Union of Spatio-Temporal Subspaces for Monocular Multiple Object 3D Reconstruction
A.Agudo and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2017

Paper  Abstract  Suppl. Material  Bibtex

@inproceedings{Agudo_cvpr2017,
title = {DUST: Dual Union of Spatio-Temporal Subspaces for Monocular Multiple Object 3D
Reconstruction},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017}
}

We present an approach to reconstruct the 3D shape of multiple deforming objects from incomplete 2D trajectories acquired by a single camera. Additionally, we simultaneously provide spatial segmentation (i.e., we identify each of the objects in every frame) and temporal clustering (i.e., we split the sequence into primitive actions). This advances existing work, which only tackled the problem for one single object and non-occluded tracks. In order to handle several objects at a time from partial observations, we model point trajectories as a union of spatial and temporal subspaces, and optimize the parameters of both modalities, the non-observed point tracks and the 3D shape via augmented Lagrange multipliers. The algorithm is fully unsupervised and results in a formulation which does not need initialization. We thoroughly validate the method on challenging scenarios with several human subjects performing different activities which involve complex motions and close interaction. We show our approach achieves state-of-the-art 3D reconstruction results, while it also provides spatial and temporal segmentation.

3D CNNs on Distance Matrices for Human Action Recognition
A.Hernandez, L.Porzi, S.Rota and F.Moreno-Noguer
ACM Conference on Multimedia (ACMMM), 2017

Paper  Abstract  Bibtex

@inproceedings{Hernandez_acmmm2017,
title = {3D CNNs on Distance Matrices for Human Action Recognition},
author = {A. Hernandez and L. Porzi and S. Rota and F. Moreno-Noguer},
booktitle = {Proceedings of the ACM Conference on Multimedia (ACMMM)},
year = {2017}
}

In this paper we are interested in recognizing human actions from sequences of 3D skeleton data. For this purpose we combine a 3D Convolutional Neural Network with body representations based on Euclidean Distance Matrices (EDMs), which have been recently shown to be very effective to capture the geometric structure of the human pose. One inherent limitation of the EDMs, however, is that they are defined up to a permutation of the skeleton joints, i.e., randomly shuffling the ordering of the joints yields many different representations. In oder to address this issue we introduce a novel architecture that simultaneously, and in an end-to-end manner, learns an optimal transformation of the joints, while optimizing the rest of parameters of the convolutional network. The proposed approach achieves state-of-the-art results on 3 benchmarks, including the recent NTU RGB-D dataset, for which we improve on previous LSTM-based methods by more than 10 percentage points, also surpassing other CNN-based methods while using almost 1000 times fewer parameters.

Multi-Modal Embedding for Main Product Detection in Fashion (Best Paper Award)
A.Rubio, L.Yu, E.Simo-Serra and F.Moreno-Noguer
Fashion Workshop in International Conference on Computer Vision (ICCVw), 2017

Paper  Abstract  Bibtex

@inproceedings{Rubio_iccvw2017,
title = {Multi-Modal Embedding for Main Product Detection in Fashion},
author = {A. Rubio and L. Yu and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision Workshops (ICCVW)},
year = {2017}
}

We present an approach to detect the main product in fashion images by exploiting the textual metadata associated with each image. Our approach is based on a Convolutional Neural Network and learns a joint embedding of object proposals and textual metadata to predict the main product in the image. We additionally use several complementary classification and overlap losses in order to improve training stability and performance. Our tests on a large-scale dataset taken from eight e-commerce sites show that our approach outperforms strong baselines and is able to accurately detect the main product in a wide diversity of challenging fashion images.

2016

Sequential Non-Rigid Structure from Motion using Physical Priors 
A.Agudo, F.Moreno-Noguer, B.Calvo and J.M.M.Montiel
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2016

Paper  Abstract  Bibtex

@article{Agudo_pami2016,
title = {Sequential Non-Rigid Structure from Motion using Physical Priors},
author = {A. Agudo and F. Moreno-Noguer and B. Calvo and J.M.M. Montiel},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {38},
number = {5},
issn = {0162-8828},
pages = {979-994},
doi = {10.1109/TPAMI.2015.2469293},
year = {2016}
}

We propose a new approach to simultaneously recover camera pose and 3D shape of non-rigid and potentially extensible surfaces from a monocular image sequence. For this purpose, we make use of the EKF-SLAM (Extended Kalman Filter based Simultaneous Localization And Mapping) formulation, a Bayesian optimization framework traditionally used in mobile robotics for estimating camera pose and reconstructing rigid scenarios. In order to extend the problem to a deformable domain we represent the object’s surface mechanics by means of Navier’s equations, which are solved using a FEM (Finite Element Method). With these main ingredients, we can further model the material’s stretching, allowing us to go a step further than most of current techniques, typically constrained to surfaces undergoing isometric deformations. We extensively validate our approach in both real and synthetic experiments, and demonstrate its advantages with respect to competing methods. More specifically, we show that besides simultaneously retrieving camera pose and non-rigid shape, our approach is adequate for both isometric and extensible surfaces, does not require neither batch processing all the frames nor tracking points over the whole sequence and runs at several frames per second.

Accurate and Linear Time Pose Estimation from Points and Lines
A.Vakhitov, J.Funke and F.Moreno-Noguer
European Conference in Computer Vision (ECCV), 2016

Paper  Abstract  Code  Bibtex

@inproceedings{Vakhitov_eccv2016,
title = {Accurate and Linear Time Pose Estimation from Points and Lines},
author = {A. Vakhitov and J. Funke and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2016}
}

The Perspective-n-Point (PnP) problem seeks to estimate the pose of a calibrated camera from n 3D-to-2D point correspondences. There are situations, though, where PnP solutions are prone to fail because feature point correspondences cannot be reliably estimated (e.g. scenes with repetitive patterns or with low texture). In such scenarios, one can still exploit alternative geometric entities, such as lines, yielding the so-called Perspective-n-Line (PnL) algorithms. Unfortunately, existing PnL solutions are not as accurate and efficient as their point-based counterparts. In this paper we propose a novel approach to introduce 3D-to-2D line correspondences into a PnP formulation, allowing to simultaneously process points and lines. For this purpose we introduce an algebraic line error that can be formulated as linear constraints on the line endpoints, even when these are not directly observable. These constraints can then be naturally integrated within the linear formulations of two state-of-the-art point-based algorithms, the OPnP and the EPnP, allowing them to indistinctly handle points, lines, or a combination of them. Exhaustive experiments show that the proposed formulation brings remarkable boost in performance compared to only point or only line based solutions, with a negligible computational overhead compared to the original OPnP and EPnP.

2015

Dense Segmentation-aware Descriptors 
E.Trulls, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer
Chapter in Dense Image Correspondences for Computer Vision, Eds. C.Liu and T.Hassner, Springer, 2015

Paper  Abstract  Bibtex

@article{Trulls_springerchapter2015,
title = {Dense Segmentation-aware Descriptors},
author = {E. Trulls and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Dense Image Correspondences for Computer Vision},
editor = {Ce Liu and Tal Hassner},
publisher = {Springer},
doi = {http://dx.doi.org/10.1007/978-3-319-23048-1}
year = {2015}
}

Dense descriptors are becoming increasingly popular in a host of tasks, such as dense image correspondence, bag-of-words image classification, and label transfer. However the extraction of descriptors on generic image points, rather than select geometric features, e.g. blobs, requires rethinking how to achieve invariance to nuisance parameters. In this work we pursue invariance to occlusions and background changes by introducing segmentation information within dense feature construction. The core idea is to use the segmentation cues to downplay the features coming from image areas that are unlikely to belong to the same region as the feature point. We show how to integrate this idea with dense SIFT, as well as with the dense Scale- and Rotation-Invariant Descriptor (SID). We thereby deliver dense descriptors that are invariant to background changes, rotation and/or scaling. We explore the merit of our technique in conjunction with large displacement motion estimation and wide-baseline stereo, and demonstrate that exploiting segmentation information yields clear improvements.

Non-Rigid Graph Registration using Active Testing Search 
E.Serradell, M.A.Pinheiro, R.Sznitman, J.Kybic, F.Moreno-Noguer and P.Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2015

Paper  Abstract  Bibtex

@article{Serradell_pami2015,
title = {Non-Rigid Graph Registration using Active Testing Search},
author = {E. Serradell and M.A. Pinheiro and R. Sznitman and J. Kybic and F. Moreno-Noguer and P. Fua},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {37},
number = {3},
issn = {0162-8828},
pages = {625-638},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2014.2343235},
year = {2015}
}

We present a new approach for matching sets of branching curvilinear structures that form graphs embedded in R2 or R3 and may be subject to deformations. Unlike earlier methods, ours does not rely on local appearance similarity nor does require a good initial alignment. Furthermore, it can cope with non-linear deformations, topological differences, and partial graphs. To handle arbitrary non-linear deformations, we use Gaussian Processes to represent the geometrical mapping relating the two graphs. In the absence of appearance information, we iteratively establish correspondences between points, update the mapping accordingly, and use it to estimate where to find the most likely correspondences that will be used in the next step. To make the computation tractable for large graphs, the set of new potential matches considered at each iteration is not selected at random as in many RANSAC-based algorithms. Instead, we introduce a so-called Active Testing Search strategy that performs a priority search to favor the most likely matches and speed-up the process. We demonstrate the effectiveness of our approach first on synthetic cases and then on angiography data, retinal fundus images, and microscopy image stacks acquired at very different resolutions.

DaLI: Deformation and Light Invariant Descriptor 
E.Simo-Serra, C.Torras and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2015

Paper  Abstract  Project page  Bibtex

@article{Simo_ijcv2015,
title = {{DaLI}: Deformation and Light Invariant Descriptor},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {115},
number = {2},
issn = {0920-5691},
pages = {135-154},
doi = {https://doi.org/10.1007/s11263-015-0805-1},
year = {2015}
}

Recent advances in 3D shape analysis and recognition have shown that heat diffusion theory can be effectively used to describe local features of deforming and scaling surfaces. In this paper, we show how this description can be used to characterize 2D image patches, and introduce DaLI, a novel feature point descriptor with high resilience to non-rigid image transformations and illumination changes. In order to build the descriptor, 2D image patches are initially treated as 3D surfaces. Patches are then described in terms of a heat kernel signature, which captures both local and global information, and shows a high degree of invariance to non-linear image warps. In addition, by further applying a logarithmic sampling and a Fourier transform, invariance to photometric changes is achieved. Finally, the descriptor is compacted by mapping it onto a low dimensional subspace computed using Principal Component Analysis, allowing for an efficient matching. A thorough experimental validation demonstrates that DaLI is significantly more discriminative and robust to illuminations changes and image transformations than state of the art descriptors, even those specifically designed to describe non-rigid deformations.

Discriminative Learning of Deep Convolutional Feature Point Descriptors
E.Simo-Serra, E.Trulls, L.Ferraz, I.Kokkinos, P.Fua and F.Moreno-Noguer
International Conference in Computer Vision (ICCV), 2015

Paper  Abstract  Project page  Bibtex

@inproceedings{Simo_iccv2015,
title = {Discriminative Learning of Deep Convolutional Feature Point Descriptors},
author = {E. Simo-Serra and E. Trulls and L. Ferraz and I. Kokkinos and P. Fua and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2015}
}

Deep learning has revolutionalized image-level tasks such as classification, but patch-level tasks, such as correspondence, still rely on handcrafted features, e.g. SIFT. In this paper we use Convolutional Neural Networks (CNNs) to learn discriminant patch representations and in particular train a Siamese network with pairs of (non-)corresponding patches. We deal with the large number of potential pairs with the combination of a stochastic sampling of the training set and an aggressive mining strategy biased towards patches that are hard to classify. By using the L2 distance during both training and testing we develop 128-D descriptors whose euclidean distances reflect patch similarity, and which can be used as a drop-in replacement for any task involving SIFT. We demonstrate consistent performance gains over the state of the art, and generalize well against scaling and rotation, perspective transformation, non-rigid deformation, and illumination changes. Our descriptors are efficient to compute and amenable to modern GPUs, and are publicly available.

Learning Shape, Motion and Elastic Models in Force Space
A.Agudo and F.Moreno-Noguer
International Conference in Computer Vision (ICCV), 2015

Paper  Abstract  Bibtex

@inproceedings{Agudo_iccv2015,
title = {Learning Shape, Motion and Elastic Models in Force Space},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2015}
}

In this paper, we address the problem of simultaneously recovering the 3D shape and pose of a deformable and potentially elastic object from 2D motion. This is a highly ambiguous problem typically tackled by using low-rank shape and trajectory constraints. We show that formulating the problem in terms of a low-rank force space that induces the deformation, allows for a better physical interpretation of the resulting priors and a more accurate representation of the actual object’s behavior. However, this comes at the price of, besides force and pose, having to estimate the elastic model of the object. For this, we use an Expectation Maximization strategy, where each of these parameters are successively learned within partial M-steps, while robustly dealing with missing observations. We thoroughly validate the approach on both mocap and real sequences, showing more accurate 3D reconstructions than state-of-the-art, and additionally providing an estimate of the full elastic model with no a priori information.

Simultaneous Pose and Non-rigid Shape with Particle Dynamics
A.Agudo and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2015

Paper  Abstract  Bibtex

@inproceedings{Agudo_cvpr2015,
title = {Simultaneous Pose and Non-rigid Shape with Particle Dynamics},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2015}
}

In this paper, we propose a sequential solution to simultaneously estimate camera pose and non-rigid 3D shape from a monocular video. In contrast to most existing approaches that rely on global representations of the shape, we model the object at a local level, as an ensemble of particles, each ruled by the linear equation of the Newton's second law of motion. This dynamic model is incorporated into a bundle adjustment framework, in combination with simple regularization components that ensure temporal and spatial consistency of the estimated shape and camera poses. The resulting approach is both efficient and robust to several artifacts such as noisy and missing data or sudden camera motions, while it does not require any training data at all. Validation is done in a variety of real video sequences, including articulated and non-rigid motion, both for continuous and discontinuous shapes. Our system is shown to perform comparable to competing batch, computationally expensive, methods and shows remarkable improvement with respect to the sequential ones.

Neuroaesthetics in Fashion: High Performance CRF Model for Cloth Parsing
E.Simo-Serra, S.Fidler, F.Moreno-Noguer and R.Urtasun
Conference on Computer Vision and Pattern Recognition (CVPR), 2015

Paper  Abstract  Project page  Bibtex

@inproceedings{Simo_cvpr2015,
title = {Neuroaesthetics in Fashion: High Performance CRF Model for Cloth Parsing},
author = {E. Simo-Serra and S. Fidler and F. Moreno-Noguer and R. Urtasun},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2015}
}

In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph’s setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.

Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions
A.Ramisa, J.Wang, Y.Lu, E.Dellandrea, F.Moreno-Noguer and R.Gaizauskas
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015

Paper  Abstract  Bibtex

@inproceedings{Ramisa_emnlp2015,
title = {Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions},
author = {A. Ramisa and J. Wang and Y. Lu and E. Dellandrea and F. Moreno-Noguer and R. Gaizauskas},
booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2015}
}

We investigate the role that geometric, textual and visual features play in the task of predicting a preposition that links two visual entities depicted in an image. The task is an important part of the subsequent process of generating image descriptions. We explore the prediction of prepositions for a pair of entities, both in the case when the labels of such entities are known and unknown. In all situations we found clear evidence that all three features contribute to the prediction task.

2014

Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection
L.Ferraz, X.Binefa and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2014

Paper  Abstract  Code  Bibtex

@inproceedings{Ferraz_cvpr2014,
title = {Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection},
author = {L. Ferraz and X. Binefa and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {501-508},
year = {2014}
}

We propose a real-time, robust to outliers and accurate solution to the Perspective-n-Point (PnP) problem. The main advantages of our solution are twofold: first, it integrates the outlier rejection within the pose estimation pipeline with a negligible computational overhead; and second, its scalability to arbitrarily large number of correspondences. Given a set of 3D-to-2D matches, we formulate pose estimation problem as a low-rank homogeneous system where the solution lies on its 1D null space. Outlier correspondences are those rows of the linear system which perturb the null space and are progressively detected by projecting them on an iteratively estimated solution of the null space. Since our outlier removal process is based on an algebraic criterion which does not require computing the full-pose and reprojecting back all 3D points on the image plane at each step, we achieve speed gains of more than 100× compared to RANSAC strategies. An extensive experimental evaluation will show that our solution yields accurate results in situations with up to 50% of outliers, and can process more than 1000 correspondences in less than 5ms.

Segmentation-aware Deformable Part Models
E.Trulls, S.Tsogkas, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2014

Paper  Abstract  Spotlight  Bibtex

@inproceedings{Trulls_cvpr2014,
title = {Segmentation-aware Deformable Part Models},
author = {E. Trulls and S. Tsogkas and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {168-175}
year = {2014}
}

In this work we propose a technique to combine bottom-up segmentation, coming in the form of SLIC superpixels, with sliding window detectors, such as Deformable Part Models (DPMs). The merit of our approach lies in ‘cleaning up’ the low-level HOG features by exploiting the spatial support of SLIC superpixels; this can be understood as using segmentation to split the feature variation into object-specific and background changes. Rather than committing to a single segmentation we use a large pool of SLIC superpixels and combine them in a scale-, position- and object-dependent manner to build soft segmentation masks. The segmentation masks can be computed fast enough to repeat this process over every candidate window, during training and detection, for both the root and part filters of DPMs. We use these masks to construct enhanced, background-invariant features to train DPMs. We test our approach on the PASCAL VOC 2007, outperforming the standard DPM in 17 out of 20 classes, yielding an average increase of 1.7% AP. Additionally, we demonstrate the robustness of this approach, extending it to dense SIFT descriptors for large displacement optical flow.

A High Performance CRF Model for Cloth Parsing
E.Simo-Serra, S.Fidler, F.Moreno-Noguer and R.Urtasun
Asian Conference on Computer Vision (ACCV), 2014

Paper  Abstract  Project page  Bibtex

@inproceedings{Simo_accv2014,
title = {A High Performance CRF Model for Cloth Parsing},
author = {E. Simo-Serra and S. Fidler and F. Moreno-Noguer and R. Urtasun},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
year = {2014}
}

In this paper we tackle the problem of clothing parsing: Our goal is to segment and classify different garments a person is wearing. We frame the problem as the one of inference in a pose-aware Conditional Random Field (CRF) which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments, and symmetries between different human body parts. We demonstrate the effectiveness of our approach on the Fashionista dataset and show that we can obtain a significant improvement over the state-of-the-art.

2013

Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation 
A.Penate-Sanchez, J.Andrade-Cetto and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013

Paper  Abstract  Code  Bibtex

@article{Penate_pami2013,
title = {Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation},
author = {A. Penate-Sanchez and J. Andrade-Cetto and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {35},
number = {10},
issn = {0162-8828},
pages = {2387-2400},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.36},
year = {2013}
}

We propose a novel approach for the estimation of the pose and focal length of a camera from a set of 3D-to-2D point correspondences. Our method compares favorably to competing approaches in that it is both more accurate than existing closed form solutions, as well as faster and also more accurate than iterative ones. Our approach is inspired on the EPnP algorithm, a recent O(n) solution for the calibrated case. Yet, we show that considering the focal length as an additional unknown renders the linearization and relinearization techniques of the original approach no longer valid, especially with large amounts of noise. We present new methodologies to circumvent this limitation termed exhaustive linearization and exhaustive relinearization which perform a systematic exploration of the solution space in closed form. The method is evaluated on both real and synthetic data, and our results show that besides producing precise focal length estimation, the retrieved camera pose is almost as accurate as the one computed using the EPnP, which assumes a calibrated camera.

Stochastic Exploration of Ambiguities for Nonrigid Shape Recovery 
F.Moreno-Noguer and P.Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013

Paper  Abstract  Video  Bibtex

@article{Moreno_pami2013,
title = {Stochastic Exploration of Ambiguities for Nonrigid Shape Recovery},
author = {F. Moreno-Noguer and P. Fua},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {35},
number = {2},
issn = {0162-8828},
pages = {463-475},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2012.102},
year = {2013}
}

Recovering the 3D shape of deformable surfaces from single images is known to be a highly ambiguous problem because many different shapes may have very similar projections. This is commonly addressed by restricting the set of possible shapes to linear combinations of deformation modes and by imposing additional geometric constraints. Unfortunately, because image measurements are noisy, such constraints do not always guarantee that the correct shape will be recovered. To overcome this limitation, we introduce a stochastic sampling approach to efficiently explore the set of solutions of an objective function based on point correspondences. This allows to propose a small set of ambiguous candidate 3D shapes and then use additional image information to choose the best one. As a proof of concept, we use either motion or shading cues to this end and show that we can handle a complex objective function without having to solve a difficult non-linear minimization problem. The advantages of our method are demonstrated on a variety of problems including both real and synthetic data.

Dense Segmentation-Aware Descriptors
E.Trulls, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2013

Paper  Abstract  Project page  Bibtex

@inproceedings{Trulls_cvpr2013,
title = {Dense Segmentation-Aware Descriptors},
author = {E. Trulls and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {2890-2897}
year = {2013}
}

In this work we exploit segmentation to construct appearance descriptors that can robustly deal with occlusion and background changes. For this, we downplay measurements coming from areas that are unlikely to belong to the same region as the descriptor’s center, as suggested by soft segmentation masks. Our treatment is applicable to any image point, i.e. dense, and its computational overhead is in the order of a few seconds. We integrate this idea with Dense SIFT, and also with Dense Scale and Rotation Invariant Descriptors (SID), delivering descriptors that are densely computable, invariant to scaling and rotation, and robust to background changes. We apply our approach to standard benchmarks on large displacement motion estimation using SIFT-flow and wide-baseline stereo, systematically demonstrating that the introduction of segmentation yields clear improvements.

A Joint Model for 2D and 3D Pose Estimation from a Single Image
E.Simo-Serra, A.Quattoni, C.Torras and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2013

Paper  Abstract  Project page  Bibtex

@inproceedings{Simo_cvpr2013,
title = {A Joint Model for 2D and 3D Pose Estimation from a Single Image},
author = {E. Simo-Serra and A. Quattoni and C. Torras and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {3634-3641}
year = {2013}
}

We introduce a novel approach to automatically recover 3D human pose from a single image. Most previous work follows a pipelined approach: initially, a set of 2D features such as edges, joints or silhouettes are detected in the image, and then these observations are used to infer the 3D pose. Solving these two problems separately may lead to erroneous 3D poses when the feature detector has performed poorly. In this paper, we address this issue by jointly solving both the 2D detection and the 3D inference problems. For this purpose, we propose a Bayesian framework that integrates a generative model based on latent variables and discriminative 2D part detectors based on HOGs, and perform inference using evolutionary algorithms. Real experimentation demonstrates competitive results, and the ability of our methodology to provide accurate 2D and 3D pose estimations even when the 2D detectors are inaccurate.

2012

Single Image 3D Human Pose Estimation from Noisy Observations
E.Simo-Serra, A.Ramisa, G.Alenya, C.Torras and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2012

Paper  Abstract  Project page  Bibtex

@inproceedings{Simo_cvpr2012,
title = {Single Image 3D Human Pose Estimation from Noisy Observations},
author = {E. Simo-Serra and A. Ramisa and G. Alenya and C. Torras and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {2673-2680}
year = {2012}
}

Markerless 3D human pose detection from a single image is a severely underconstrained problem because different 3D poses can have similar image projections. In order to handle this ambiguity, current approaches rely on prior shape models that can only be correctly adjusted if 2D image features are accurately detected. Unfortunately, although current 2D part detectors algorithms have shown promising results, they are not yet accurate enough to guarantee a complete disambiguation of the 3D inferred shape. In this paper, we introduce a novel approach for estimating 3D human pose even when observations are noisy. We propose a stochastic sampling strategy to propagate the noise from the image plane to the shape space. This provides a set of ambiguous 3D shapes, which are virtually undistinguishable from their image projection. Disambiguation is then achieved by imposing kinematic constraints that guarantee the resulting pose resembles a 3D human shape. We validate the method on a variety of situations in which state-of-the-art 2D detectors yield either inaccurate estimations or partly miss some of the body parts.

Robust Non-Rigid Registration of 2D and 3D Graphs
E.Serradell, P.Glowacki, J.Kybic, F.Moreno-Noguer and P.Fua
Conference on Computer Vision and Pattern Recognition (CVPR), 2012

Paper  Abstract  Bibtex

@inproceedings{Serradell_cvpr2012,
title = {Robust Non-Rigid Registration of 2D and 3D Graphs},
author = {E. Serradell and P. Glowacki and J. Kybic and F. Moreno-Noguer and P. Fua},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {996-1003}
year = {2012}
}

We present a new approach to matching graphs embedded in R^2 or R^3. Unlike earlier methods, our approach does not rely on the similarity of local appearance features, does not require an initial alignment, can handle partial matches, and can cope with non-linear deformations and topological differences. To handle arbitrary non-linear deformations, we represent them as Gaussian Processes. In the absence of appearance information, we iteratively establish correspondences between graph nodes, update the structure accordingly, and use the current mapping estimate to find the most likely correspondences that will be used in the next iteration. This makes the computation tractable. We demonstrate the effectiveness of our approach first on synthetic cases and then on angiography data, retinal fundus images, and microscopy image stacks acquired at very different resolutions.

Spatiotemporal Descriptor for Wide-Baseline Stereo Reconstruction of Non-Rigid and Ambiguous Scenes
E.Trulls, A.Sanfeliu and F.Moreno-Noguer
Europen Conference on Computer Vision (ECCV), 2012

Paper  Abstract  Spotlight  Project page  Bibtex

@inproceedings{Trulls_eccv2012,
title = {Spatiotemporal Descriptor for Wide-Baseline Stereo Reconstruction of Non-Rigid and Ambiguous Scenes},
author = {E. Trulls and and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
volume = {7574},
series = {Lecture Notes in Computer Science},
pages = {441-454}
year = {2012}
}

This paper studies the use of temporal consistency to match appearance descriptors and handle complex ambiguities when computing dynamic depth maps from stereo. Previous attempts have designed 3D descriptors over the space-time volume and have been mostly used for monocular action recognition, as they cannot deal with perspective changes. Our approach is based on a state-of-the-art 2D dense appearance descriptor which we extend in time by means of optical flow priors, and can be applied to wide-baseline stereo setups. The basic idea behind our approach is to capture the changes around a feature point in time instead of trying to describe the spatiotemporal volume. We demonstrate its effectiveness on very ambiguous synthetic video sequences with ground truth data, as well as real sequences.

2011

Simultaneous Correspondence and Non-Rigid 3D Reconstruction of the Coronary Tree from Single X-Ray Images
E.Serradell, A.Romero, R.Leta, C.Gatta and F.Moreno-Noguer
International Conference on Computer Vision (ICCV), 2011

Paper  Abstract  Bibtex

@inproceedings{Serradell_iccv2011,
title = {Simultaneous Correspondence and Non-Rigid 3D Reconstruction of the Coronary Tree from Single X-Ray Images},
author = {E. Serradell and A. Romero and R. Leta and C. Gatta and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
pages = {850-857}
year = {2011}
}

We present a novel approach to simultaneously recon- struct the 3D structure of a non-rigid coronary tree and estimate point correspondences between an input X-ray image and a reference 3D shape. At the core of our approach lies an optimization scheme that iteratively fits a generative 3D model of increasing complexity and guides the matching process. As a result, and in contrast to existing approaches that assume rigidity or quasi-rigidity of the structure, our method is able to retrieve large non-linear deformations even when the input data is corrupted by the presence of noise and partial occlusions. We extensively evaluate our approach under synthetic and real data and demonstrate a remarkable improvement compared to state-of-the-art.

Deformation and Illumination Invariant Feature Point Descriptor
F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2011

Paper  Abstract  Bibtex

@inproceedings{Moreno_cvpr2011a,
title = {Deformation and Illumination Invariant Feature Point Descriptor},
author = {F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {1593-1600}
year = {2011}
}

Recent advances in 3D shape recognition have shown that kernels based on diffusion geometry can be effectively used to describe local features of deforming surfaces. In this paper, we introduce a new framework that allows using these kernels on 2D local patches, yielding a novel feature point descriptor that is both invariant to non-rigid image deformations and illumination changes. In order to build the descriptor, 2D image patches are embedded as 3D surfaces, by multiplying the intensity level by an arbitrarily large and constant weight that favors anisotropic diffusion and retains the gradient magnitude information. Patches are then described in terms of a heat kernel signature, which is made invariant to intensity changes, rotation and scaling. The resulting feature point descriptor is proven to be significantly more discriminative than state of the art ones, even those which are specifically designed for describing non-rigid image deformations.

Probabilistic Simultaneous Pose and Non-Rigid Shape
F.Moreno-Noguer and J.M.Porta
Conference on Computer Vision and Pattern Recognition (CVPR), 2011

Paper  Abstract  Bibtex

@inproceedings{Moreno_cvpr2011b,
title = {Probabilistic Simultaneous Pose and Non-Rigid Shape},
author = {F. Moreno-Noguer and J.M. Porta},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {1289-1296}
year = {2011}
}

We present an algorithm to simultaneously recover non-rigid shape and camera poses from point correspondences between a reference shape and a sequence of input images. The key novel contribution of our approach is in bringing the tools of the probabilistic SLAM methodology from a rigid to a deformable domain. Under the assumption that the shape may be represented as a weighted sum of deformation modes, we show that the problem of estimating the modal weights along with the camera poses, may be probabilistically formulated as a maximum a posterior estimate and solved using an iterative least squares optimization. An extensive evaluation on synthetic and real data, shows that our approach has several significant advantages over current approaches, such as performing robustly under large amounts of noise and outliers, and neither requiring to track points over the whole sequence nor initializations close from the ground truth solution.

2010

Simultaneous Pose, Correspondence and Non-Rigid Shape
J.Sanchez, J.Östlund, P.Fua and F.Moreno-Noguer
Conference on Computer Vision and Pattern Recognition (CVPR), 2010

Paper  Abstract  Bibtex

@inproceedings{Sanchez_cvpr2010,
title = {Simultaneous Pose, Correspondence and Non-Rigid Shape},
author = {J. Sanchez and J. Östlund and P. Fua and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {1189-1196}
year = {2010}
}

Recent works have shown that 3D shape of non-rigid surfaces can be accurately retrieved from a single image given a set of 3D-to-2D correspondences between that image and another one for which the shape is known. However, existing approaches assume that such correspondences can be readily established, which is not necessarily true when large deformations produce significant appearance changes between the input and the reference images. Furthermore, it is either assumed that the pose of the camera is known, or the estimated solution is pose-ambiguous. In this paper we relax all these assumptions and, given a set of 3D and 2D unmatched points, we present an approach to simultaneously solve their correspondences, compute the camera pose and retrieve the shape of the surface in the input image. This is achieved by introducing weak priors on the pose and shape that we model as Gaussian Mixtures. By combining them into a Kalman filter we can progressively reduce the number of 2D candidates that can be potentially matched to each 3D point, while pose and shape are refined. This lets us to perform a complete and efficient exploration of the solution space and retain the best solution.

Efficient Rotation Invariant Object Detection using Boosted Random Ferns
M.Villamizar, F.Moreno-Noguer, J.Andrade-Cetto and A.Sanfeliu
Conference on Computer Vision and Pattern Recognition (CVPR), 2010

Paper  Abstract  Bibtex

@inproceedings{Villamizar_cvpr2010,
title = {Efficient Rotation Invariant Object Detection using Boosted Random Ferns},
author = {M. Villamizar and F. Moreno-Noguer and J. Andrade-Cetto and A. Sanfeliu},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {1038-1045}
year = {2010}
}

We present a new approach for building an efficient and robust classifier for the two class problem, that localizes objects that may appear in the image under different orientations. In contrast to other works that address this problem using multiple classifiers, each one specialized for a specific orientation, we propose a simple two-step approach with an estimation stage and a classification stage. The estimator yields an initial set of potential object poses that are then validated by the classifier. This methodology allows reducing the time complexity of the algorithm while classification results remain high. The classifier we use in both stages is based on a boosted combination of Random Ferns over local histograms of oriented gradients (HOGs), which we compute during a pre-processing step. Both the use of supervised learning and working on the gradient space makes our approach robust while being efficient at run-time. We show these properties by thorough testing on standard databases and on a new database made of motorbikes under planar rotations, and with challenging conditions such as cluttered backgrounds, changing illumination conditions and partial occlusions.

Exploring Ambiguities for Monocular Non-Rigid Shape Estimation
F.Moreno-Noguer, J.M.Porta and P.Fua
European Conference on Computer Vision (ECCV), 2010

Paper  Abstract  Bibtex

@inproceedings{Moreno_eccv2010,
title = {Exploring Ambiguities for Monocular Non-Rigid Shape Estimation},
author = {F. Moreno-Noguer and J.M. Porta and P. Fua},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
volume = {6313},
series = {Lecture Notes in Computer Science},
pages = {370-383}
year = {2010}
}

Recovering the 3D shape of deformable surfaces from single images is difficult because many different shapes have very similar projections. This is commonly addressed by restricting the set of possible shapes to linear combinations of deformation modes and by imposing additional geometric constraints. Unfortunately, because image measurements are noisy, such constraints do not always guarantee that the correct shape will be recovered. To overcome this limitation, we introduce an efficient approach to exploring the set of solutions of an objective function based on point-correspondences and to proposing a small set of candidate 3D shapes. This allows the use of additional image information to choose the best one. As a proof of concept, we use either motion or shading cues to this end and show that we can handle a complex objective function without having to solve a difficult non-linear minimization problem.

Combining Geometric and Appearance Priors for Robust Homography Estimation
E.Serradell, M.Özuysal, V.Lepetit, P.Fua and F.Moreno-Noguer
European Conference on Computer Vision (ECCV), 2010

Paper  Abstract  Bibtex

@inproceedings{Serradell_eccv2010,
title = {Combining Geometric and Appearance Priors for Robust Homography Estimation},
author = {E. Serradell and M. Özuysal and V. Lepetit and P. Fua and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
volume = {6313},
series = {Lecture Notes in Computer Science},
pages = {58-72}
year = {2010}
}

The homography between pairs of images are typically computed from the correspondence of keypoints, which are established by using image descriptors. When these descriptors are not reliable, either because of repetitive patterns or large amounts of clutter, additional priors need to be considered. The Blind PnP algorithm makes use of geometric priors to guide the search for matches while computing camera pose. Inspired by this, we propose a novel approach for homography estimation that combines geometric priors with appearance priors of ambiguous descriptors. More specifically, for each point we retain its best candidates according to appearance. We then prune the set of potential matches by iteratively shrinking the regions of the image that are consistent with the geometric prior. We can then successfully compute homographies between pairs of images containing highly repetitive pat- terns and even under oblique viewing conditions.

2009

EPnP: An Accurate O(n) Solution to the PnP Problem 
V.Lepetit, F.Moreno-Noguer and P.Fua
International Journal of Computer Vision (IJCV), 2009

Paper  Abstract  Bibtex

@article{Lepetit_ijcv2009,
title = {{EPnP}: An Accurate O(n) Solution to the PnP Problem},
author = {V. Lepetit and F. Moreno-Noguer and P. Fua},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {81},
number = {2},
issn = {0920-5691},
pages = {155-166},
doi = {https://doi.org/10.1007/s11263-008-0152-6},
year = {2009}
}

We propose a non-iterative solution to the PnP problem —the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences— whose computational complexity grows linearly with n. This is in contrast to state-of-the-art methods that are O(n^5) or even O(n^8), without being more accurate. Our method is applicable for all n ≥ 4 and handles properly both planar and non-planar configurations. Our central idea is to express the n 3D points as a weighted sum of four virtual control points. The problem then reduces to estimating the coordinates of these control points in the camera referential, which can be done in O(n) time by expressing these coordinates as weighted sum of the eigenvectors of a 12 × 12 matrix and solving a small constant number of quadratic equations to pick the right weights. Furthermore, if maximal precision is required, the output of the closed-form solution can be used to initialize a Gauss-Newton scheme, which improves accuracy with negligible amount of additional time. The advantages of our method are demonstrated by thorough testing on both synthetic and real-data.

Capturing 3D Stretchable Surfaces from Single Images in Closed Form
F.Moreno-Noguer, M.Salzmann, V.Lepetit and P.Fua
Conference on Computer Vision and Pattern Recognition (CVPR), 2009

Paper  Abstract  Video  Bibtex

@inproceedings{Moreno_cvpr2009,
title = {Capturing 3D Stretchable Surfaces from Single Images in Closed Form},
author = {F. Moreno-Noguer and M. Salzmann and V. Lepetit and P. Fua},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {1842-1849}
year = {2009}
}

We present a closed-form solution to the problem of recovering the 3D shape of a non-rigid potentially stretchable surface from 3D-to-2D correspondences. In other words, we can reconstruct a surface from a single image without a priori knowledge of its deformations in that image. State-of-the-art solutions to non-rigid 3D shape recovery rely on the fact that distances between neighboring surface points must be preserved and are therefore limited to inelastic surfaces. Here, we show that replacing the inextensibility constraints by shading ones removes this limitation while still allowing 3D reconstruction in closed-form. We demonstrate our method and compare it to an earlier one using both synthetic and real data.

2008

Dependent Multiple Cue Integration for Robust Tracking 
F.Moreno-Noguer, D.Samaras and A.Sanfeliu
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2008

Paper  Abstract  Bibtex

@article{Moreno_pami2008,
title = {Dependent Multiple Cue Integration for Robust Tracking},
author = {F. Moreno-Noguer and D. Samaras and A. Sanfeliu},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {30},
number = {4},
issn = {0162-8828},
pages = {670-685},
doi = {http://doi.ieeecomputersociety.org/0.1109/TPAMI.2007.70727},
year = {2008}
}

We propose a new technique for fusing multiple cues to robustly segment an object from its background in video sequences that suffer from abrupt changes of both illumination and position of the target. Robustness is achieved by the integration of appearance and geometric object features and by their estimation using Bayesian filters, such as Kalman or particle filters. In particular, each filter estimates the state of a specific object feature, conditionally dependent on another feature estimated by a distinct filter. This dependence provides improved target representations, permitting us to segment it out from the background even in nonstationary sequences. Considering that the procedure of the Bayesian filters may be described by a “hypotheses generation - hypotheses correction” strategy, the major novelty of our methodology compared to previous approaches is that the mutual dependence between filters is considered during the feature observation, that is, into the “hypotheses-correction” stage, instead of considering it when generating the hypotheses. This proves to be much more effective in terms of accuracy and reliability. The proposed method is analytically justified and applied to develop a robust tracking system that adapts online and simultaneously the color space where the image points are represented, the color distributions, the contour of the object, and its bounding box. Results with synthetic data and real video sequences demonstrate the robustness and versatility of our method.

Pose Priors for Simultaneously Solving Alignment and Correspondence
F.Moreno-Noguer, V.Lepetit and P.Fua
European Conference on Computer Vision (ECCV), 2008

Paper  Abstract  Code  Bibtex

@inproceedings{Moreno_eccv2008,
title = {Pose Priors for Simultaneously Solving Alignment and Correspondence},
author = {F. Moreno-Noguer and V. Lepetit and P. Fua},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
volume = {5303},
series = {Lecture Notes in Computer Science},
pages = {405-418}
year = {2008}
}

Estimating a camera pose given a set of 3D-object and 2D-image feature points is a well understood problem when correspondences are given. However, when such correspondences cannot be established a priori, one must simultaneously compute them along with the pose. Most current approaches to solving this problem are too computationally intensive to be practical. An interesting exception is the SoftPosit algorithm, that looks for the solution as the minimum of a suitable objective function. It is arguably one of the best algorithms but its iterative nature means it can fail in the presence of clutter, occlusions, or repetitive patterns. In this paper, we propose an approach that overcomes this limitation by taking advantage of the fact that, in practice, some prior on the camera pose is often available. We model it as a Gaussian Mixture Model that we progressively refine by hypothesizing new correspondences. This rapidly reduces the number of potential matches for each 3D point and lets us explore the pose space more thoroughly than SoftPosit at a similar computational cost. We will demonstrate the superior performance of our approach on both synthetic and real data.

Closed-Form Solution to Non-Rigid 3D Surface Detection
M. Salzmann, F.Moreno-Noguer, V.Lepetit and P.Fua
European Conference on Computer Vision (ECCV), 2008

Paper  Abstract  Video  Bibtex

@inproceedings{Salzmann_eccv2008,
title = {Closed-Form Solution to Non-Rigid 3D Surface Detection},
author = {M. Salzmann and F. Moreno-Noguer and V. Lepetit and P. Fua},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
volume = {5303},
series = {Lecture Notes in Computer Science},
pages = {581-594}
year = {2008}
}

We present a closed-form solution to the problem of recovering the 3D shape of a non-rigid inelastic surface from 3D-to-2D correspondences. This lets us detect and reconstruct such a surface by matching individual images against a reference configuration, which is in contrast to all existing approaches that require initial shape estimates and track deformations from image to image. We represent the surface as a mesh, and write the constraints provided by the correspondences as a linear system whose solution we express as a weighted sum of eigenvectors. Obtaining the weights then amounts to solving a set of quadratic equations accounting for inextensibility constraints between neighboring mesh vertices. Since available closed-form solutions to quadratic systems fail when there are too many variables, we reduce the number of unknowns by expressing the deformations as a linear combination of modes. The overall closed-form solution then becomes tractable even for complex deformations that require many modes.

2007

Active Refocusing of Images and Videos 
F.Moreno-Noguer, P.N.Belhumeur and S.K.Nayar
ACM Transactions on Graphics (SIGGRAPH), 2007

Paper  Abstract  Video  Bibtex

@article{Moreno_siggraph2007,
title = {Active Refocusing of Images and Videos},
author = {F. Moreno-Noguer and P.N. Belhumeur and S.K. Nayar},
booktitle = {ACM Transactions on Graphics (SIGGRAPH)},
volume = {26},
number = {3},
issn = {0730-0301},
pages = {463-475},
doi = {10.1145/1276377.1276461},
year = {2007}
}

We present a system for refocusing images and videos of dynamic scenes using a novel, single-view depth estimation method. Our method for obtaining depth is based on the defocus of a sparse set of dots projected onto the scene. In contrast to other active illumination techniques, the projected pattern of dots can be removed from each captured image and its brightness easily controlled in order to avoid under- or over-exposure. The depths corresponding to the projected dots and a color segmentation of the image are used to compute an approximate depth map of the scene with clean region boundaries. The depth map is used to refocus the acquired image after the dots are removed, simulating realistic depth of field effects. Experiments on a wide variety of scenes, including close-ups and live action, demonstrate the effectiveness of our method.

Accurate Non-Iterative O(n) Solution to the PnP Problem
F.Moreno-Noguer, V.Lepetit and P.Fua
International Conference on Computer Vision (ICCV), 2007

Paper  Abstract  Code Matlab  Code C++  Bibtex

@inproceedings{Moreno_iccv2007,
title = {Accurate Non-Iterative O(n) Solution to the PnP Problem},
author = {F. Moreno-Noguer and V. Lepetit and P. Fua},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
pages = {1-8}
year = {2007}
}

We propose a non-iterative solution to the PnP problem —the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences— whose computational complexity grows linearly with n. This is in contrast to state-of-the-art methods that are O(n^5) or even O(n^8), without being more accurate. Our method is applicable for all n ≥ 4 and handles properly both planar and non-planar configurations. Our central idea is to express the n 3D points as a weighted sum of four virtual control points. The problem then reduces to estimating the coordinates of these control points in the camera referential, which can be done in O(n) time by expressing these coordinates as weighted sum of the eigenvectors of a 12 × 12 matrix and solving a small constant number of quadratic equations to pick the right weights. The advantages of our method are demon- strated by thorough testing on both synthetic and real-data.

2005

Integration of Conditionally Dependent Object Features for Robust Figure/Background Segmentation
F.Moreno-Noguer, A.Sanfeliu and D.Samaras
International Conference on Computer Vision (ICCV), 2005

Paper  Abstract  Video Bibtex

@inproceedings{Moreno_iccv2005,
title = {Integration of Conditionally Dependent Object Features for Robust Figure/Background Segmentation},
author = {F. Moreno-Noguer and A. Sanfeliu and D. Samaras},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
pages = {1713-1720}
year = {2005}
}

We propose a new technique for fusing multiple cues to robustly segment an object from its background in video sequences that suffer from abrupt changes of both illumination and position of the target. Robustness is achieved by the integration of appearance and geometric object features and by their description using particle filters. Previous approaches assume independence of the object cues or apply the particle filter formulation to only one of the features, and assume a smooth change in the rest, which can prove is very limiting, especially when the state of some features needs to be updated using other cues or when their dynamics follow non-linear and unpredictable paths. Our technique offers a general framework to model the probabilistic relationship between features. The proposed method is analytically justified and applied to develop a robust tracking system that adapts online and simultaneously the col- orspace where the image points are represented, the color distributions, and the contour of the object. Results with synthetic data and real video sequences demonstrate the ro- bustness and versatility of our method.