2022

Conference

Learned Vertex Descent: A New Direction for 3D Human Model Fitting
E.Corona, G.Pons-Moll, G.Alenyà and F.Moreno-Noguer
European Conference on Computer Vision (ECCV), 2022

@inproceedings{Corona_eccv2022,
title = {Learned Vertex Descent: A New Direction for 3D Human Model Fitting,
author = {Enric Corona and Gerard Pons-Moll and Guillem Alenyà and Francesc Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022}
}

We propose a novel optimization-based paradigm for 3D human model fitting on images and scans. In contrast to existing approaches that directly regress the parameters of a low-dimensional statistical body model (e.g. SMPL) from input images, we train an ensemble of per vertex neural fields network. The network predicts, in a distributed manner, the vertex descent direction towards the ground truth, based on neural features extracted at the current vertex projection. At inference, we employ this network, dubbed LVD, within a gradient-descent optimization pipeline until its convergence, which typically occurs in a fraction of a second even when initializing all vertices into a single point. An exhaustive evaluation demonstrates that our approach is able to capture the underlying body of clothed people with very different body shapes, achieving a significant improvement compared to state-of-the-art. LVD is also applicable to 3D model fitting of humans and hands, for which we show a significant improvement to the SOTA with a much simpler and faster method. Code is released at https://www.iri.upc.edu/people/ecorona/lvd/

Conditional-Flow NeRF: Accurate 3D Modelling with Reliable Uncertainty Quantification   
J.Shen, A.Agudo, F.Moreno-Noguer and A.Ruiz
European Conference on Computer Vision (ECCV), 2022

@inproceedings{Shen_eccv2022,
title = {Conditional-Flow NeRF: Accurate 3D Modelling with Reliable Uncertainty Quantification,
author = {Jianxiong Shen and Antonio Agudo and Francesc Moreno-Noguer and Adria Ruiz},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022}
}

A critical limitation of current methods based on Neural Radiance Fields (NeRF) is that they are unable to quantify the uncertainty associated with the learned appearance and geometry of the scene. This information is paramount in real applications such as medical diagnosis or autonomous driving where, to reduce potentially catastrophic failures, the confidence on the model outputs must be included into the decision-making process. In this context, we introduce Conditional-Flow NeRF (CF-NeRF), a novel probabilistic framework to incorporate uncertainty quantification into NeRF-based approaches. For this purpose, our method learns a distribution over all possible radiance fields modelling the scene which is used to quantify the uncertainty associated with the modelled scene. In contrast to previous approaches enforcing strong constraints over the radiance field distribution, CF-NeRF learns it in a flexible and fully data-driven manner by coupling Latent Variable Modelling and Conditional Normalizing Flows. This strategy allows to obtain reliable uncertainty estimation while preserving model expressivity. Compared to previous state-of-the-art methods proposed for uncertainty quantification in NeRF, our experiments show that the proposed method achieves significantly lower prediction errors and more reliable uncertainty values for synthetic novel view and depth-map estimation

PoseScript: 3D Human Poses from Natural Language   
G.Delmas, P.Weinzaepfel, T.Lucas, F.Moreno-Noguer and G.Rogez
European Conference on Computer Vision (ECCV), 2022

@inproceedings{Delmas_eccv2022,
title = {PoseScript: 3D Human Poses from Natural Language,
author = {Ginger Delmas and Philippe Weinzaepfel and Thomas Lucas and Francesc Moreno-Noguer and Grégory Rogez},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022}
}

...

Multi-Person Extreme Motion Prediction   
W.Guo, X.Bie, X.Alameda-Pineda and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2022

@inproceedings{Guo_cvpr2022,
title = {Multi-Person Extreme Motion Prediction},
author = {Wen Guo and Xiaoyu Bie and Xavier Alameda-Pineda and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2022}
}

Human motion prediction aims to forecast future poses given a sequence of past 3D skeletons. While this problem has recently received increasing attention, it has mostly been tackled for single humans in isolation. In this paper, we explore this problem when dealing with humans performing collaborative tasks, we seek to predict the future motion of two interacted persons given two sequences of their past skeletons. We propose a novel cross interaction attention mechanism that exploits historical information of both persons, and learns to predict cross dependencies between the two pose sequences. Since no dataset to train such interactive situations is available, we collected ExPI (Extreme Pose Interaction) dataset, a new lab-based person interaction dataset of professional dancers performing Lindy-hop dancing actions, which contains 115 sequences with 30K frames annotated with 3D body poses and shapes. We thoroughly evaluate our cross interaction network on ExPI and show that both in short- and long-term predictions, it consistently outperforms state-of-the-art methods for single-person motion prediction. Our code and dataset are available at: https://team.inria.fr/robotlearn/multi-person-extreme-motion-prediction/

LISA: Learning Implicit Shape and Appearance of Hands   
E.Corona, T.Hogan, M.Vo, F.Moreno-Noguer, C.Sweeney, R.Newcombe and L.Ma
Conference in Computer Vision and Pattern Recognition (CVPR), 2022

@inproceedings{Corona_cvpr2022,
title = {{LISA}: Learning Implicit Shape and Appearance of Hands},
author = {Enric Corona and Tomas Hogan and Minh Vo and Francesc Moreno-Noguer and Chris Sweeney and Richard Newcombe and Lingni Ma},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
}

This paper proposes a do-it-all neural model of human hands, named LISA. The model can capture accurate hand shape and appearance, generalize to arbitrary hand subjects, provide dense surface correspondences, be reconstructed from images in the wild, and can be easily animated. We train LISA by minimizing the shape and appearance losses on a large set of multi-view RGB image sequences annotated with coarse 3D poses of the hand skeleton. For a 3D point in the local hand coordinates, our model predicts the color and the signed distance with respect to each hand bone independently, and then combines the per-bone predictions using the predicted skinning weights. The shape, color, and pose representations are disentangled by design, enabling fine control of the selected hand parameters. We experimentally demonstrate that LISA can accurately reconstruct a dynamic hand from monocular or multi-view sequences, achieving a noticeably higher quality of reconstructed hand shapes compared to baseline ap- proaches. Project page: https:// www.iri.upc.edu/people/ecorona/lisa/.

2021

Journal

3D Human Pose, Shape and Texture from Low-Resolution Images and Videos 
X.Xu, H.Chen, F.Moreno-Noguer, L.Jeni and F. De la Torre
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2021

@article{Xu_pami2021,
title = {3D Human Pose, Shape and Texture from Low-Resolution Images and Videos},
author = {Xiangyu Xu and Hao Chen and Francesc Moreno-Noguer and Laszlo Attila Jeni and Fernando De la Torre},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {},
number = {},
issn = {0162-8828},
pages = {},
doi = {},
year = {2021}
}

3D human pose and shape estimation from monocular images has been an active research area in computer vision. Existing deep learning methods for this task rely on high-resolution input, which however, is not always available in many scenarios such as video surveillance and sports broadcasting. Two common approaches to deal with low-resolution images are applying super-resolution techniques to the input, which may result in unpleasant artifacts, or simply training one model for each resolution, which is impractical in many realistic applications. To address the above issues, this paper proposes a novel algorithm called RSC-Net, which consists of a Resolution-aware network, a Self-supervision loss, and a Contrastive learning scheme. The proposed method is able to learn 3D body pose and shape across different resolutions with one single model. The self-supervision loss enforces scale-consistency of the output, and the contrastive learning scheme enforces scale-consistency of the deep features. We show that both these new losses provide robustness when learning in a weakly-supervised manner. Moreover, we extend the RSC-Net to handle low-resolution videos and apply it to reconstruct textured 3D pedestrians from low-resolution input. Extensive experiments demonstrate that the RSC-Net can achieve consistently better results than the state-of-the-art methods for challenging low-resolution images.

Conference

Neural Cellular Automata Manifold    (Oral)
A.Hernandez, A.Vilalta and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2021

@inproceedings{Hernandez_cvpr2021,
title = {Neural Cellular Automata Manifold},
author = {Alejandro Hernandez and Armand Vilalta and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

Very recently, the Neural Cellular Automata (NCA) has been proposed to simulate the morphogenesis process with deep networks. NCA learns to grow an image starting from a fixed single pixel. In this work, we show that the neural network (NN) architecture of the NCA can be encapsulated in a larger NN. This allows us to propose a new model that encodes a manifold of NCA, each of them capable of generating a distinct image. Therefore, we are effectively learning a embedding space of CA, which shows generalization capabilities. We accomplish this by introducing dynamic convolutions inside an Auto-Encoder architecture, for the first time used to join two different sources of information, the encoding and cell’s environment information. In biological terms, our approach would play the role of the transcription factors, modulating the mapping of genes into specific proteins that drive cellular differentiation, which occurs right before the morphogenesis. We thoroughly evaluate our approach in a dataset of synthetic emojis and also in real images of CIFAR- 10. Our model introduces a general-purpose network, which can be used in a broad range of problems beyond image generation.

SMPLicit: Topology-aware Generative Model for Clothed People   
E.Corona, A.Pumarola, G.Alenyà, G.Pons-Moll and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2021

@inproceedings{Corona_cvpr2021,
title = {SMPLicit: Topology-aware Generative Model for Clothed People},
author = {Enric Corona and Albert Pumarola and Guillem Aleny{\`a} and Gerard Pons-Moll and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

In this paper we introduce SMPLicit, a novel generative model to jointly represent body pose, shape and clothing geometry. In contrast to existing learning-based approaches that require training specific models for each type of garment, SMPLicit can represent in a unified manner different garment topologies (e.g. from sleeveless tops to hoodies and to open jackets), while controlling other properties like the garment size or tightness/looseness. We show our model to be applicable to a large variety of garments including T- shirts, hoodies, jackets, shorts, pants, skirts, shoes and even hair. The representation flexibility of SMPLicit builds upon an implicit model conditioned with the SMPL human body parameters and a learnable latent space which is semantically interpretable and aligned with the clothing attributes. The proposed model is fully differentiable, allowing for its use into larger end-to-end trainable systems. In the experimental section, we demonstrate SMPLicit can be readily used for fitting 3D scans and for 3D reconstruction in images of dressed people. In both cases we are able to go beyond state of the art, by retrieving complex garment geometries, handling situations with multiple clothing layers and providing a tool for easy outfit editing. To stimulate further research in this direction, we will make our code and model publicly available at http://www.iri.upc. edu/people/ecorona/smplicit/.

D-NeRF: Neural Radiance Fields for Dynamic Scenes   
A.Pumarola, E.Corona, G.Pons-Moll and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2021

@inproceedings{Pumarola_cvpr2021,
title = {D-NeRF: Neural Radiance Fields for Dynamic Scenes},
author = {Albert Pumarola and Enric Corona and Gerard Pons-Moll and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

Neural rendering techniques combining machine learning with geometric reasoning have arisen as one of the most promising approaches for synthesizing novel views of a scene from a sparse set of images. Among these, stands out the Neural radiance fields (NeRF), which trains a deep network to map 5D input coordinates (representing spatial location and viewing direction) into a volume density and view-dependent emitted radiance. However, despite achieving an unprecedented level of photorealism on the generated images, NeRF is only applicable to static scenes, where the same spatial location can be queried from different images. In this paper we introduce D-NeRF, a method that extends neural radiance fields to a dynamic domain, allowing to reconstruct and render novel images of objects under rigid and non-rigid motions from a single camera moving around the scene. For this purpose we consider time as an additional input to the system, and split the learning process in two main stages: one that encodes the scene into a canonical space and another that maps this canonical representation into the deformed scene at a particular time. Both mappings are simultaneously learned using fully-connected networks. Once the networks are trained, D-NeRF can render novel images, controlling both the camera view and the time variable, and thus, the object movement. We demonstrate the effectiveness of our approach on scenes with objects under rigid, articulated and non-rigid motions. Code, model weights and the dynamic scenes dataset will be released.

Uncertainty-Aware Camera Pose Estimation from Points and Lines   
A.Vakhitov, L.Ferraz, A.Agudo and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2021

@inproceedings{Vakhitov_cvpr2021,
title = {Uncertainty-Aware Camera Pose Estimation from Points and Lines},
author = {Alexander Vakhitov and Luis Ferraz and Antonio Agudo and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2021}
}

Perspective-n-Point-and-Line (PnPL) algorithms aim at fast, accurate, and robust camera localization with respect to a 3D model from 2D-3D feature correspondences, being a major part of modern robotic and AR/VR systems. Current point-based pose estimation methods use only 2D feature detection uncertainties, and the line-based methods do not take uncertainties into account. In our setup, both 3D coordinates and 2D projections of the features are considered uncertain. We propose globally convergent PnP solvers based on EPnP and DLS for the uncertainty-aware pose estimation. We also modify to the motion-only bundle adjustment to take 3D uncertainties into account. We perform exhaustive synthetic and real experiments on two different visual odometry datasets. The new PnP(L) methods outperform the state-of-the-art on real data in isolation, showing an increase in mean translation accuracy by 12% on a representative subset of KITTI, while the new uncertain refinement improves pose accuracy for most of the solvers, e.g. decreasing mean translation error for the EPnP by 5% compared to the standard pose refinement on the same dataset. We will release the code of the proposed methods.

Stochastic Neural Radiance Fields: Quantifying Uncertainty in Implicit 3D Representations   
J.Shen, A.Ruiz, A.Agudo and F.Moreno-Noguer
International Conference on 3D Vision (3DV), 2021

@inproceedings{Shen_3dv2021,
title = {Stochastic Neural Radiance Fields: Quantifying Uncertainty in Implicit 3D Representations},
author = {Jianxiong Shen and Adria Ruiz and Antonio Agudo and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Conference on 3D Vision (3DV)},
year = {2021}
}

Neural Radiance Fields (NeRF) has become a popular framework for learning implicit 3D representations and addressing different tasks such as novel-view synthesis or depth-map estimation. However, in downstream applications where decisions need to be made based on automatic predictions, it is critical to leverage the confidence associated with the model estimations. Whereas uncertainty quantification is a long-standing problem in Machine Learning, it has been largely overlooked in the recent NeRF literature. In this context, we propose Stochastic Neural Radiance Fields (S-NeRF), a generalization of standard NeRF that learns a probability distribution over all the possible radiance fields modeling the scene. This distribution allows to quantify the uncertainty associated with the scene information provided by the model. S-NeRF optimization is posed as a Bayesian learning problem that is efficiently addressed using the Variational Inference framework. Exhaustive experiments over benchmark datasets demonstrate that S-NeRF is able to provide more reliable predictions and confidence values than generic approaches previously proposed for uncertainty estimation in other domains.

Body Size and Depth Disambiguation in Multi-Person Reconstruction from Single Images   
N.Ugrinovic, A.Ruiz, A.Agudo, A.Sanfeliu and F.Moreno-Noguer}
International Conference on 3D Vision (3DV), 2021

@inproceedings{Ugrinovic_3dv2021,
title = {Body Size and Depth Disambiguation in Multi-Person Reconstruction from Single Images},
author = {Nicolas Ugrinovic and Adria Ruiz and Antonio Agudo and Alberto Sanfeliu and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Conference on 3D Vision (3DV)},
year = {2021}
}

We address the problem of multi-person 3D body pose and shape estimation from a single image. While this problem can be addressed by applying single-person approaches multiple times for the same scene, recent works have shown the advantages of building upon deep architectures that simultaneously reason about all people in the scene in a holistic manner by enforcing, e.g., depth order constraints or minimizing interpenetration among reconstructed bodies. However, existing approaches are still unable to capture the size variability of people caused by the inherent body scale and depth ambiguity. In this work, we tackle this challenge by devising a novel optimization scheme that learns the appropriate body scale and relative camera pose, by enforcing the feet of all people to remain on the ground floor. A thorough evaluation on MuPoTS-3D and 3DPW datasets demonstrates that our approach is able to robustly estimate the body translation and shape of multiple people while retrieving their spatial arrangement, consistently improving current state-of-the-art, especially in scenes with people of very different heights.

SIDER: Single-Image Neural Optimization for Facial Geometric Detail Recovery   
A.Chatziagapi, S.Athar, F.Moreno-Noguer and D.Samaras
International Conference on 3D Vision (3DV), 2021

@inproceedings{Chatziagapi_3dv2021,
title = {{SIDER}: Single-Image Neural Optimization for Facial Geometric Detail Recovery},
author = {Aggelina Chatziagapi and ShahRukh Athar and Francesc Moreno-Noguer and Dimitris Samaras},
booktitle = {Proceedings of the International Conference on 3D Vision (3DV)},
year = {2021}
}

We present SIDER (Single-Image neural optimization for facial geometric DEtail Recovery), a novel photometric optimization method that recovers detailed facial geometry from a single image in an unsupervised manner. Inspired by classical techniques of coarse-to-fine optimization and recent advances in implicit neural representations of 3D shape, SIDER combines a geometry prior based on statistical models and Signed Distance Functions (SDFs) to recover facial details from single images. First, it estimates a coarse geometry using a morphable model represented as an SDF. Next, it reconstructs facial geometry details by optimizing a photometric loss with respect to the ground truth image. In contrast to prior work, SIDER does not rely on any dataset priors and does not require additional supervision from multiple views, lighting changes or ground truth 3D shape. Extensive qualitative and quantitative evaluation demonstrates that our method achieves state-of-the-art on facial geometric detail recovery, using only a single in-the-wild image.

PhysXNet: A Customizable Approach for Learning Cloth Dynamics on Dressed People   
J.Sanchez-Riera, A.Pumarola and F.Moreno-Noguer
International Conference on 3D Vision (3DV), 2021

@inproceedings{Sanchez_3dv2021,
title = {{PhysXNet}: A Customizable Approach for Learning Cloth Dynamics on Dressed People},
author = {Jordi Sanchez-Riera and Albert Pumarola and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Conference on 3D Vision (3DV)},
year = {2021}
}

We introduce PhysXNet, a learning-based approach to predict the dynamics of deformable clothes given 3D skeleton motion sequences of humans wearing these clothes. The proposed model is adaptable to a large variety of garments and changing topologies, without need of being retrained. Such simulations are typically carried out by physics engines that require manual human expertise and are subject to computationally intensive computations. PhysXNet, by contrast, is a fully differentiable deep network that at inference is able to estimate the geometry of dense cloth meshes in a matter of milliseconds, and thus, can be readily deployed as a layer of a larger deep learning architecture. This efficiency is achieved thanks to the specific parameterization of the clothes we consider, based on 3D UV maps encoding spatial garment displacements. The problem is then formulated as a mapping between the human kinematics space (represented also by 3D UV maps of the undressed body mesh) into the clothes displacement UV maps, which we learn using a conditional GAN with a discriminator that enforces feasible deformations. We train simultaneously our model for three garment templates, tops, bottoms and dresses for which we simulate deformations under 50 different human actions. Nevertheless, the UV map representation we consider allows encapsulating many different cloth topologies, and at test we can simulate garments even if we did not specifically train for them. A thorough evaluation demonstrates that PhysXNet delivers cloth deformations very close to those computed with the physical engine, opening the door to be effectively integrated within deep learning pipelines.

PI-Net: Pose Interacting Network for Multi-Person Monocular 3D Pose Estimation   
W.Guo, E.Corona, F.Moreno-Noguer and X.Alameda-Pineda
Winter Conference on Applications of Computer Vision (WACV), 2021

@inproceedings{Guo_wacv2021,
title = {{PI-Net}: Pose Interacting Network for Multi-Person Monocular 3D Pose Estimation},
author = {Wen Guo and Enric Corona and Francesc Moreno-Noguer and Xavier Alameda-Pineda},
booktitle = {Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)},
year = {2021}
}

Recent literature addressed the monocular 3D pose estimation task very satisfactorily. In these studies, different persons are usually treated as independent pose instances to estimate. However, in many every-day situations, people are interacting, and the pose of an individual depends on the pose of his/her interactees. In this paper, we investigate how to exploit this dependency to enhance current -- and possibly future -- deep networks for 3D monocular pose estimation. Our pose interacting network, or PI-Net, inputs the initial pose estimates of a variable number of interactees into a recurrent architecture used to refine the pose of the person-of-interest. Evaluating such a method is challenging due to the limited availability of public annotated multi-person 3D human pose datasets. We demonstrate the effectiveness of our method in the MuPoTS dataset, setting the new state-of-the-art on it. Qualitative results on other multi-person datasets (for which 3D pose ground-truth is not available) showcase the proposed PI-Net. PI-Net is implemented in PyTorch and the code will be made available upon acceptance of the paper.

Multi-FinGAN: Generative Coarse-To-Fine Sampling of Multi-Finger Grasps   
J.Lundell, E.Corona, T.Nguyen Le, F.Verdoja, P.Weinzaepfel, G.Rogez, F.Moreno-Noguer and V.Kyrki
International Conference on Robotics and Automation (ICRA), 2021

@inproceedings{Lundell_icra2021,
title = {Multi-FinGAN}: Generative Coarse-To-Fine Sampling of Multi-Finger Grasps,
author = {ens Lundell and Enric Corona and Tran Nguyen Le and Francesco Verdoja and Philippe Weinzaepfel and Grégory Rogez and Francesc Moreno-Noguer and Ville Kyrki},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2021}
}

While there exists many methods for manipulating rigid objects with parallel-jaw grippers, grasping with multi-finger robotic hands remains a quite unexplored research topic. Reasoning and planning collision-free trajectories on the additional degrees of freedom of several fingers represents an important challenge that, so far, involves computationally costly and slow processes. In this work, we present Multi-FinGAN, a fast generative multi-finger grasp sampling method that synthesizes high quality grasps directly from RGB-D images in about a second. We achieve this by training in an end-to-end fashion a coarse-to-fine model composed of a classification network that distinguishes grasp types according to a specific taxonomy and a refinement network that produces refined grasp poses and joint angles. We experimentally validate and benchmark our method against a standard grasp-sampling method on 790 grasps in simulation and 20 grasps on a real Franka Emika Panda. All experimental results using our method show consistent improvements both in terms of grasp quality metrics and grasp success rate. Remarkably, our approach is up to 20-30 times faster than the baseline, a significant improvement that opens the door to feedback-based grasp re-planning and task informative grasping.

Attention deep learning based model for predicting the 3D Human Body Pose using the Robot Human Handover Phases   
J.Laplaza, A.Pumarola, F.Moreno-Noguer and A.Sanfeliu
International Conference on Robot & Human Interactive Communication (RO-MAN), 2021

@inproceedings{Laplaza_roman2021,
title = {Attention deep learning based model for predicting the 3D Human Body Pose using the Robot Human Handover Phases},
author = {Javier Laplaza and Albert Pumarola and Francesc Moreno-Noguer and Alberto Sanfeliu},
booktitle = {Proceedings of the IEEE International Conference on Robot & Human Interactive Communication (RO-MAN)},
year = {2021}
}

This work proposes a human motion prediction model for handover operations. We use in this work, the different phases of the handover operation to improve the human motion predictions. Our attention deep learning based model takes into account the position of the robot’s End Effector and the phase in the handover operation to predict future human poses. Our model outputs a distribution of possible positions rather than one deterministic position, a key feature in order to allow robots to collaborate with humans. The attention deep learning based model has been trained and evaluated with a dataset created using human volunteers and an anthropomorphic robot, simulating handover operations where the robot is the giver and the human the receiver. For each operation, the human skeleton is obtained with an Intel RealSense D435i camera attached inside the robot’s head. The results shown a great improvement of the human’s right hand prediction and 3D body compared with other methods.

E-DNAS: Differentiable Neural Architecture Search for Embedded Systems   
J.García, A.Agudo and F.Moreno-Noguer
International Conference on Pattern Recognition (ICPR), 2021

@inproceedings{Garcia_icpr2021,
title = {{E-DNAS}: Differentiable Neural Architecture Search for Embedded Systems},
author = {Javier García and Antonio Agudo and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Pattern Recognition (ICPR)},
year = {2021}
}

Designing optimal and light weight networks to fit in resource-limited platforms like mobiles, DSPs or GPUs is a challenging problem with a wide range of interesting applications, e.g. in embedded systems for autonomous driving. While most approaches are based on manual hyperparameter tuning, there exist a new line of research, the so-called NAS (Neural Architecture Search) methods, that aim to optimize several metrics during the design process, including memory requirements of the network, number of FLOPs, number of MACs (Multiply-ACcumulate operations) or inference latency. However, while NAS methods have shown very promising results, they are still significantly time and cost consuming. In this work we introduce E-DNAS, a differentiable architecture search method, which improves the efficiency of NAS methods in designing light-weight networks for the task of image classification. Concretely, E-DNAS computes, in a differentiable manner, the optimal size of a number of meta-kernels that capture patterns of the input data at different resolutions. We also leverage on the additive property of convolution operations to merge several kernels with different compatible sizes into a single one, reducing thus the number of operations and the time required to estimate the optimal configuration. We evaluate our approach on several datasets to perform classification. We report results in terms of the SoC (System on Chips) metric, typically used in the Texas Instruments TDA2x families for autonomous driving applications. The results show that our approach allows designing low latency architectures significantly faster than state-of-the-art.

2020

Journal

GANimation: One-Shot Anatomically Consistent Facial Animation
A.Pumarola, A.Agudo, A.M.Martinez, A.Sanfeliu and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2020

@article{Pumarola_ijcv2020,
title = {GANimation: One-Shot Anatomically Consistent Facial Animation},
author = {Albert Pumarola and Antonio Agudo and Aleix M. Martinez and Alberto Sanfeliu and Francesc Moreno-Noguer},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {128},
number = {},
issn = {0920-5691},
pages = {698-713},
doi = {https://doi.org/10.1007/s11263-016-0972-8},
year = {2019},
month = {March}
}

Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for the task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs' generation process with images of a specific domain, namely a set of images of people sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content and granularity of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combining several of them. Additionally, we propose a weakly supervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit a novel self-learned attention mechanism that makes our network robust to changing backgrounds, lighting conditions and occlusions. Extensive evaluation shows that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild. The code of this work is publicly available at https://github.com/albertpumarola/GANimation.

Dual-branch CNNs for Vehicle Detection and Tracking on LiDAR Data
V.Vaquero, I.del Pino ,F.Moreno-Noguer, J.Solà, A.Sanfeliu and J.Andrade
IEEE Transactions on Intelligent Transportation Systems (T-ITS), 2020

@article{Vaquero_tits2020,
title = {Dual-branch CNNs for Vehicle Detection and Tracking on LiDAR Data},
author = {Victor Vaquero and Ivan del Pino and Francesc Moreno-Noguer and Joan Solà and Alberto Sanfeliu and Juan Andrade},
booktitle = {IEEE Transactions on Intelligent Transportation Systems (T-ITS)},
volume = {},
number = {},
issn = {1524-9050},
pages = {1-12},
doi = {https://doi.org/10.1109/TITS.2020.2998771},
year = {2020},
month = {}
}

We present a novel vehicle detection and tracking system that works solely on 3D LiDAR information. Our approach segments vehicles using a dual-view representation of the 3D LiDAR point cloud on two independently trained convolutional neural networks, one for each view. A bounding box growing algorithm is applied to the fused output of the networks to properly enclose the segmented vehicles. Bounding boxes are grown using a probabilistic method that takes into account also occluded areas. The final vehicle bounding boxes act as observations for a multi-hypothesis tracking system which allows to estimate the position and velocity of the observed vehicles. We thoroughly evaluate our system on the KITTI benchmarks both for detection and tracking separately and show that our dual-branch classifier consistently outperforms previous single-branch approaches, improving or directly competing to other state of the art LiDAR-based methods.

Conference

3D Human Shape and Pose from a Single Low-Resolution Image  
X.Xu, H.Chen, F.Moreno-Noguer, L.Jeni and F.De la Torre 
European Conference on Computer Vision (ECCV), 2020

@inproceedings{Xu_eccv2020,
title = {3D Human Shape and Pose from a Single Low-Resolution Image},
author = {Xiangyu Xu and Hao Chen and Francesc Moreno-Noguer and Laszlo Attila Jeni and Fernando De la Torre},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2020}
}

3D human shape and pose estimation from monocular images has been an active area of research in computer vision, having a substantial impact on the development of new applications, from activity recognition to creating virtual avatars. Existing deep learning meth- ods for 3D human shape and pose estimation rely on relatively high- resolution input images; however, high-resolution visual content is not always available in several practical scenarios such as video surveillance and sports broadcasting. Low-resolution images in real scenarios can vary in a wide range of sizes, and a model trained in one resolution does not typically degrade gracefully across resolutions. Two common approaches to solve the problem of low-resolution input are applying super-resolution techniques to the input images which may result in visual artifacts, or simply training one model for each resolution, which is impractical in many realistic applications. To address the above issues, this paper proposes a novel algorithm called RSC-Net, which consists of a resolution-aware network, a self-supervision loss, and a contrastive learning scheme. The proposed network is able to learn the 3D body shape and pose across different resolutions with a single model. The self-supervision loss encourages scale-consistency of the output, and the contrastive learning scheme enforces scale-consistency of the deep features. We show that both these new training losses provide robustness when learning 3D shape and pose in a weakly-supervised manner. Extensive experiments demonstrate that the RSC-Net can achieve consistently better results than the state-of-the-art methods for challenging low-resolution images.

GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes    (Oral)
E.Corona, A.Pumarola, G.Alenyà, F.Moreno-Noguer and G.Rogez 
Conference in Computer Vision and Pattern Recognition (CVPR), 2020

@inproceedings{Corona_cvpr2020a,
title = {GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes},
author = {Enric Corona and Albert Pumarola and Guillem Aleny{\`a} and Francesc Moreno-Noguer and Gregory Rogez},
booktitle = {Proceedings of the Conference in Computer Vision and Pattern Recognition (CVPR)},
year = {2020}
}

The rise of deep learning has brought remarkable progress in estimating hand geometry from images where the hands are part of the scene. This paper focuses on a new problem not explored so far, consisting in predicting how a human would grasp one or several objects, given a single RGB image of these objects. This is a problem with enormous potential in e.g. augmented reality, robotics or prosthetic design. In order to predict feasible grasps, we need to understand the semantic content of the image, its geometric structure and all potential interactions with a hand physical model. To this end, we introduce a generative model that jointly reasons in all these levels and 1) regresses the 3D shape and pose of the objects in the scene; 2) estimates the grasp types; and 3) refines the 51-DoF of a 3D hand model that minimize a graspability loss. To train this model we build the YCB-Affordance dataset, that contains more than 133k images of 21 objects in the YCB-Video dataset. We have annotated these images with more than 28M plausible 3D human grasps according to a 33-class taxonomy. A thorough evaluation in synthetic and real images shows that our model can robustly predict realistic grasps, even in cluttered scenes with multiple objects in close contact.

C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds 
A.Pumarola, S.Popov, F.Moreno-Noguer and V.Ferrari
Conference in Computer Vision and Pattern Recognition (CVPR), 2020

@inproceedings{Pumarola_cvpr2020,
title = {C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds},
author = {Albert Pumarola and Stefan Popov and Francesc Moreno-Noguer and Vittorio Ferrari},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020}
}

Flow-based generative models have highly desirable properties like exact log-likelihood evaluation and exact latent-variable inference, however they are still in their infancy and have not received as much attention as alternative generative models. In this paper, we introduce C-Flow, a novel conditioning scheme that brings normalizing flows to an entirely new scenario with great possibilities for multi-modal data modeling. C-Flow is based on a parallel sequence of invertible mappings in which a source flow guides the target flow at every step, enabling fine-grained control over the generation process. We also devise a new strategy to model unordered 3D point clouds that, in combination with the conditioning scheme, makes it possible to address 3D reconstruction from a single image and its inverse problem of rendering an image given a point cloud. We demonstrate our conditioning method to be very adaptable, being also applicable to image manipulation, style transfer and multi-modal image-to-image mapping in a diversity of domains, including RGB images, segmentation maps and edge masks.

Context-aware Human Motion Prediction 
E.Corona, A.Pumarola, G.Alenyà and F.Moreno-Noguer
Conference in Computer Vision and Pattern Recognition (CVPR), 2020

@inproceedings{Corona_cvpr2020b,
title = {Context-aware Human Motion Prediction},
author = {Enric Corona and Albert Pumarola and Guillem Aleny{\`a} and Francesc Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2020}
}

The problem of predicting human motion given a sequence of past observations is at the core of many applications in robotics and computer vision. Current state-of-the-art formulate this problem as a sequence-to-sequence task, in which a historical of 3D skeletons feeds a Recurrent Neural Network (RNN) that predicts future movements, typically in the order of 1 to 2 seconds. However, one aspect that has been obviated so far, is the fact that human motion is inherently driven by interactions with objects and/or other humans in the environment. In this paper, we explore this scenario using a novel context-aware motion prediction architecture. We use a semantic-graph model where the nodes parameterize the human and objects in the scene and the edges their mutual interactions. These interactions are iteratively learned through a graph attention layer, fed with the past observations, which now include both object and human body motions. Once this semantic graph is learned, we inject it to a standard RNN to predict future movements of the human/s and object/s. We consider two variants of our architecture, either freezing the contextual interactions in the future of updating them. A thorough evaluation in the “Whole-Body Human Motion Database” shows that in both cases, our context-aware networks clearly outperform baselines in which the context information is not considered.

Integrating Human Body MoCaps into Blender using RGB Images 
J.Sanchez and F.Moreno-Noguer
International Conference on Advances in Computer-Human Interaction, 2020

@inproceedings{Sanchez_achi2020,
title = {Integrating Human Body MoCaps into Blender using RGB Images},
author = {Jordi Sanchez and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Advances in Computer-Human Interaction},
year = {2020}
}

Reducing the complexity and cost of a motion capture (mocap) system has been of great interest in recent years. Unlike other systems that use depth range cameras, we present an algorithm that is capable of work as a mocap system with a single RGB camera and it is completely integrated in an off-the-shelf rendering software. This makes our system easily deployable in outdoor and unconstrained scenarios. Our approach builds upon three main modules. First, given solely one input RGB image we estimate 2D body pose; the second module estimates the 3D human pose from the previously calculated 2D coordinates and the last module calculates the necessary rotations of the joints given the goal 3D point coordinates and the 3D virtual human model. We quantitaviely evaluate the first two modules using synthetic images, and provide qualitative results of the overall system with real images recorded from a webcam.

Workshop

Textual Visual Semantic Dataset for Text Spotting 
A.Sabir, F.Moreno-Noguer and L.Padró
Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020

@inproceedings{Sanchez_achi2020,
title = {Textual Visual Semantic Dataset for Text Spotting},
author = {Ahmed Sabir and Francesc Moreno-Noguer and Lluis Padro},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
year = {2020}
}

Text Spotting in the wild consists of detecting and recognizing text appearing in images (e.g. signboards, traffic signals or brands in clothing or objects). This is a challenging problem due to the complexity of the context where texts appear (uneven backgrounds, shading, occlusions, perspective distortions, etc.). Only a few approaches try to exploit the relation between text and its surrounding environment to better recognize text in the scene. In this paper, we propose a visual context dataset for Text Spotting in the wild, where the publicly available dataset COCO-text has been extended with information about the scene (such as objects and places appearing in the image) to enable researchers to include semantic relations between texts and scene in their Text Spotting systems, and to offer a common framework for such approaches. For each text in an image, we extract three kinds of context information: objects in the scene, image location label and a textual image description (caption). We use state-of-the-art out-of-the-box available tools to extract this additional information. Since this information has textual form, it can be used to lever- age text similarity or semantic relation methods into Text Spotting systems, either as a post-processing or in an end-to-end training strategy. Our data is publicly available in https://git.io/JeZTb.

Differentiable Data Augmentation with Kornia 
Shi, E.Riba, D.Mishkin, F.Moreno-Noguer and A.Nicolaou
NeurIPS Workshop on Differentiable Computer Vision (NeurIPSW), 2020

@inproceedings{Shi_neuripsw2020,
title = {Differentiable Data Augmentation with Kornia},
author = {Jian Shi and Edgar Riba and Dmytro Mishkin and Francesc Moreno-Noguer and Anguelos Nicolaou},
booktitle = {Proceedings of the NeurIPS Workshop on Differentiable Computer Vision (NeurIPSW)},
year = {2020}
}

In this paper we present a review of the Kornia differentiable data augmentation (DDA) module for both for spatial (2D) and volumetric (3D) tensors. This module leverages differentiable computer vision solutions from Kornia, with an aim of integrating data augmentation (DA) pipelines and strategies to existing PyTorch components (e.g. autograd for differentiability, optim for optimization). In addition, we provide a benchmark comparing different DA frameworks and a short review for a number of approaches that make use of Kornia DDA.

2019

Book Chapters

Relative Localization for Aerial Manipulation with PL-SLAM 
A.Pumarola, A.Vakhitov, A.Agudo, F.Moreno-Noguer and A.Sanfeliu
Chapter in Springer Tracks in Advanced Robotics, 2019

@article{Pumarola_springerchapter2019,
title = {Relative Localization for Aerial Manipulation with PL-SLAM},
author = {Albert Pumarola and Alexander Vakhitov and Antonio Agudo and Francesc Moreno-Noguer and Alberto Sanfeliu},
booktitle = {Springer Tracks in Advanced Robotics},
volume = {129},
isbn = {978-3-030-12944-6},
pages = {239-248},
doi = {10.1007/978-3-030-12945-3_17},
year = {2019}
}

This chapter explains a precise SLAM technique, PL-SLAM,that allows to simultaneously process points and lines and tackle situations where point-only based methods are prone to fail, like poorly textured scenes or motion blurred images where feature points are vanished out. The method is remarkably robust against image noise, and that it outperforms state-of-the-art methods for point based contour alignment. The method can run in real-time and in a low cost hardware.

Precise Localization for Aerial Inspection using Augmented Reality Markers 
A.Amor, A.Ruiz, F.Moreno-Noguer and A.Sanfeliu
Chapter in Springer Tracks in Advanced Robotics, 2019

@article{Amor_springerchapter2019,
title = {Precise Localization for Aerial Inspection using Augmented Reality Markers},
author = {Adrian Amor-Martinez and Alberto Ruiz and Francesc Moreno-Noguer and Alberto Sanfeliu},
booktitle = {Springer Tracks in Advanced Robotics},
volume = {129},
isbn = {978-3-030-12944-6},
pages = {249-259},
doi = {10.1007/978-3-030-12945-3_17},
year = {2019}
}

This chapter is devoted to explaining a method for precise localization using augmented reality markers. This method can achieve precision of less of 5mm in position at a distance of 0.7m, using a visual mark of 17mm × 17mm, and it can be used by controller when the aerial robot is doing a manipulation task. The localization method is based on optimizing the alignment of deformable contours from textureless images working from the raw vertexes of the observed contour. The algorithm optimizes the alignment of the XOR area computed by means of computer graphics clipping techniques. The method can run at 25 frames per second.

Journals

Robust Spatio-Temporal Clustering and Reconstruction of Multiple Deformable Bodies 
A.Agudo and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2019

@article{Agudo_pami2019,
title = {Robust Spatio-Temporal Clustering and Reconstruction of Multiple Deformable Bodies},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {41},
number = {4},
issn = {0162-8828},
pages = {971 - 984},
doi = {10.1109/TPAMI.2017.2676778},
year = {2019}
}

In this paper we present an approach to reconstruct the 3D shape of multiple deforming objects from a collection of sparse, noisy and possibly incomplete 2D point tracks acquired by a single monocular camera. Additionally, the proposed solution estimates the camera motion and reasons about the spatial segmentation (i.e., identifies each of the deforming objects in every frame) and temporal clustering (i.e., splits the sequence into motion primitive actions). This advances competing work, which mainly tackled the problem for one single object and non-occluded tracks. In order to handle several objects at a time from partial observations, we model point trajectories as a union of spatial and temporal subspaces, and optimize the parameters of both modalities, the non-observed point tracks, the camera motion, and the time-varying 3D shape via augmented Lagrange multipliers. The algorithm is fully unsupervised and does not require any training data at all. We thoroughly validate the method on challenging scenarios with several human subjects performing different activities which involve complex motions and close interaction. We show our approach achieves state-of-the-art 3D reconstruction results, while it also provides spatial and temporal segmentation.

Shape Basis Interpretation for Monocular Deformable 3D Reconstruction 
A.Agudo and F.Moreno-Noguer
IEEE Transactions on Multimedia, 2019

@article{Agudo_tm2019,
title = {Shape Basis Interpretation for Monocular Deformable 3D Reconstruction},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Multimedia},
volume = {21},
number = {4},
issn = {1941-0077},
pages = {821 - 834},
doi = {10.1109/TMM.2018.2870081},
year = {2019}
}

In this paper, we propose a novel interpretable shape model to encode object non-rigidity. We first use the initial frames of a monocular video to recover a rest shape, used later to compute a dissimilarity measure based on a distance matrix measurement. Spectral analysis is then applied to this matrix to obtain a reduced shape basis, that in contrast to existing approaches, can be physically interpreted. In turn, these pre-computed shape bases are used to linearly span the deformation of a wide variety of objects. We introduce the low-rank basis into a sequential approach to recover both camera motion and non-rigid shape from the monocular video, by simply optimizing the weights of the linear combination using bundle adjustment. Since the number of parameters to optimize per frame is relatively small, specially when physical priors are considered, our approach is fast and can potentially run in real time. Validation is done in a wide variety of real-world objects, undergoing both inextensible and extensible deformations. Our approach achieves remarkable robustness to artifacts such as noisy and missing measurements and shows an improved performance to competing methods.

Online learning and detection of faces with low human supervision 
M.Villamizar, A.Sanfeliu and F.Moreno-Noguer
The Visual Computer, 2019

@article{Villamizar_vc2019,
title = {Online learning and detection of faces with low human supervision},
author = {Michael Villamizar and Alberto Sanfeliu and Francesc Moreno-Noguer},
booktitle = {The Visual Computer},
volume = {35},
issn = {0178-2789},
pages = {349-370},
doi = {10.1007/s00371-018-01617-y},
year = {2019}
}

We present an efficient, online, and interactive approach for computing a classifier, called Wild Lady Ferns (WiLFs), for face learning and detection using small human supervision. More precisely, on the one hand, WiLFs combine online boosting and extremely randomized trees (random ferns) to compute progressively an efficient and discriminative classifier. On the other hand, WiLFs use an interactive human–machine approach that combines two complementary learning strategies to reduce considerably the degree of human supervision during learning. While the first strategy corresponds to query-by-boosting active learning, that requests human assistance over difficult samples in function of the classifier confidence, the second strategy refers to a memory-based learning which uses K exemplar-based nearest neighbors (KENN) to assist automatically the classifier. A pretrained convolutional neural network is used to perform KENN with high-level feature descriptors. The proposed approach is therefore fast (WilFs run in 1 FPS using a code not fully optimized), accurate (we obtain detection rates over 82% in complex datasets), and labor-saving (human assistance percentages of less than 20%). As a by-product, we demonstrate that WiLFs also perform semiautomatic annotation during learning, as while the classifier is being computed, WiLFs are discovering faces instances in input images which are used subsequently for training online the classifier. The advantages of our approach are demonstrated in synthetic and publicly available databases, showing comparable detection rates as offline approaches that require larger amounts of handmade training data.

Using a new high-throughput video-tracking platform to assess behavioural changes in Daphnia magna exposed to neuro-active drugs 
Fátima Simão et al.
Science of The Total Environment, 2019

@article{Simao_ste2019,
title = {Using a new high-throughput video-tracking platform to assess behavioural changes in Daphnia magna exposed to neuro-active drugs},
author = {Fátima Simão and Fernando Martínez-Jerónimo and Victor Blasco and Francesc Moreno-Noguer and Josep M. Porta and João L.T. Pestana and Amadeu M.V.M. Soares and Demetrio Raldúa and Carlos Barata},
booktitle = {Science of The Total Environment},
volume = {662},
issn = {0162-8828},
pages = {160-167},
doi = {10.1016/j.scitotenv.2019.01.187},
year = {2019}
}

Recent advances in imaging allow to monitor in real time the behaviour of individuals under a given stress. Light is a common stressor that alters the behaviour of fish larvae and many aquatic invertebrate species. The water flea Daphnia magna exhibits a vertical negative phototaxis, swimming against light trying to avoid fish predation. The aim of this study was to develop a high-throughput image analysis system to study changes in the vertical negative phototaxis of D. magna first reproductive adult females exposed to 0.1 and 1 μg/L of four neuro-active drugs: diazepam, fluoxetine, propranolol and carbamazepine. Experiments were conducted using a custom designed experimental chamber containing four independent arenas and infrared illumination. The apical-located visible light and the GigE camera located in front of the arenas were controlled by the Ethovision XT 11.5 sofware (Noldus Information Technology, Leesburg, VA). Total distance moved, time spent per zone (bottom vs upper zones) and distance among individuals were analyzed in dark and light conditions, and the effect of different intensities of the apical-located visible light was also investigated. Results indicated that light intensity increased the locomotor activity and low light intensities allowed to better discriminate individual responses to the studied drugs. The four tested drugs decreased the response of exposed organisms to light: individuals moved less, were closer to the bottom and at low light intensities were closer each other. At high light intensities, however, exposed individuals were less aggregated. Propranolol, carbamazepine and fluoxetine induced the most severe behavioural effects. The tested drugs at environmental relevant concentrations altered locomotor activity, geotaxis, phototaxis and aggregation in D. magna individuals in the lab. Therefore the new image analysis system presented here was proven to be sensitive and versatile enough to detect changes in diel vertical migration across light intensities and low concentration levels of neuro-active drugs.

Conference

3DPeople: Modeling the Geometry of Dressed Humans
A.Pumarola, J.Sanchez, G.P.Choi, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Computer Vision (ICCV), 2019

@inproceedings{Pumarola_iccv2019,
title = {3DPeople: Modeling the Geometry of Dressed Humans},
author = {A. Pumarola and J. Sanchez and G.P.T. Choi and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2019}
}

Recent advances in 3D human shape estimation build upon parametric representations that model very well the shape of the naked body, but are not appropriate to represent the clothing geometry. In this paper, we present an approach to model dressed humans and predict their geometry from single images. We contribute in three fundamental aspects of the problem, namely, a new dataset, a novel shape parameterization algorithm and an end-to-end deep generative network for predicting shape. First, we present 3DPeople, a large-scale synthetic dataset with 2 Million photo-realistic images of 80 subjects performing 70 activities and wearing diverse outfits. We annotate the dataset with body SMPL parameters, segmentation masks, skeletons, depth, normal maps and optical flow. All this together makes 3DPeople suitable for a plethora of tasks. We then represent the 3D shapes using 2D geometry images. To build these images we propose a novel spherical area-preserving parameterization algorithm based on the optimal mass transportation method. We show this approach to improve existing spherical maps which tend to shrink the elongated parts of the full body models such as the arms and legs, making the geometry images incomplete. Finally, we design a multi-resolution deep generative network that, given an input image of a dressed human, predicts his/her geometry image (and thus the clothed body shape) in an end-to-end manner. We obtain very promising results in jointly capturing body pose and clothing shape, both for synthetic validation and on the wild images.

Human Motion Prediction via Spatio-Temporal Inpainting 
A.Hernandez, J.Gall and F.Moreno-Noguer
International Conference on Computer Vision (ICCV), 2019

@inproceedings{Hernandez_iccv2019,
title = {Human Motion Prediction via Spatio-Temporal Inpainting},
author = {A. Hernandez and J. Gall and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2019}
}

We propose a Generative Adversarial Network (GAN) to forecast 3D human motion given a sequence of past 3D skeleton poses. While recent GANs have shown promising results, they can only forecast plausible motion over relatively short periods of time (few hundred milliseconds) and typically ignore the absolute position of the skeleton w.r.t. the camera. Our scheme provides long term predictions (two seconds or more) for both the body pose and its absolute position. Our approach builds upon three main contributions. First, we represent the data using a spatio-temporal tensor of 3D skeleton coordinates which allows formulating the prediction problem as an inpainting one, for which GANs work particularly well. Secondly, we design an architecture to learn the joint distribution of body poses and global motion, capable to hypothesize large chunks of the input 3D tensor with missing data. And finally, we argue that the L2 metric, considered so far by most approaches, fails to capture the actual distribution of long-term human motion. We propose two alternative metrics, based on the distribution of frequencies, that are able to capture more realistic motion patterns. Extensive experiments demonstrate our approach to significantly improve the state of the art, while also handling situations in which past observations are corrupted by occlusions, noise and missing frames.

Semantic Relatedness based Re-ranker for Text Spotting 
A.Sabir, F.Moreno-Noguer and L.Padró
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

@inproceedings{Sabir_emnlp2019,
title = {Semantic Relatedness based Re-ranker for Text Spotting},
author = {Ahmed Sabir and Francesc Moreno-Noguer and Lluís Padró},
booktitle = {Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2019}
}

Applications such as textual entailment, plagiarism detection or document clustering rely on the notion of semantic similarity, and are usually approached with dimension reduction techniques like LDA or with embedding-based neural approaches. We present a scenario where semantic similarity is not enough, and we devise a neural approach to learn semantic relatedness. The scenario is text spotting in the wild, where a text in an image (e.g. street sign, advertisement or bus destination) must be identified and recognized. Our goal is to improve the performance of vision systems by leveraging semantic information. Our rationale is that the text to be spotted is often related to the image context in which it appears (word pairs such as Delta–airplane, or quarters–parking are not similar, but are clearly related). We show how learning a word-to-word or word-to-sentence relatedness score can im- prove the performance of text spotting systems up to 2.9 points, outperforming other measures in a benchmark dataset.

Improving Map Re-localization with Deep `Movable' Objects Segmentation on 3D LiDAR Point Clouds 
V.Vaquero, K.Fischer, F.Moreno-Noguer, A.Sanfeliu and S.Milz
IEEE Intelligent Transportation Systems Conference (ITSC), 2019

@inproceedings{Vaquero_itsc2019,
title = {Improving Map Re-localization with Deep `Movable' Objects Segmentation on 3D LiDAR Point Clouds},
author = {Victor Vaquero and Kai Fischer and Francesc Moreno-Noguer and Alberto Sanfeliu and Stefan Milz},
booktitle = {Proceedings of IEEE Intelligent Transportation Systems Conference (ITSC)},
year = {2019}
}

Localization and Mapping is an essential component to enable Autonomous Vehicles navigation, and requires an accuracy exceeding that of commercial GPS-based systems. Current odometry and mapping algorithms are nowadays able to provide this accurate information. However, the lack of robustness of these algorithms against dynamic obstacles and environmental changes, even for sort time periods, force the generation of new maps on every session without taking advantage of previously obtained ones. In this paper we propose the use of deep learning network to segment movable objects from 3D LiDAR point clouds in order to obtain longer-lasting 3D maps. This will in turn allow for better, faster and more accurate re-localization and trajectoy estimation on subsequent days. We show the effectiveness of our approach in a very dynamic and cluttered scenario, i.e. a supermarket parking lot. For that, we record several sequences on different days and compare localization errors with and without our movable objects segmentation method. Results show that we are able to accurately re-locate over a filtered map consistently reducing trajectory errors between an average of 35.1% with respect to a not filtered map version and of 47.9% with respect to a standalone map created on the current session.

Vehicle detection on an FPGA from LiDAR Point Clouds 
J.García, A.Agudo and F.Moreno-Noguer
International Conference on Watermarking and Image Processing (ICWIP), 2019

@inproceedings{Garcia_icwip2019,
title = {Vehicle detection on an FPGA from LiDAR Point Clouds},
author = {Javier García and Antonio Agudo and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Watermarking and Image Processing (ICWIP)},
year = {2019}
}

In this paper is presented a deep neural network architecture designed to run on a field-programmable gate array (FPGA) for detection vehicle on LIDAR point clouds. This works present a network based on VoxelNet adapted to run on an FPGA and to locate vehicles on point clouds from a 32 and a 64 channel optical sensor. For training the presented network the Kitti and Nuscenes dataset have been used. This work aims to motivate the usage of dedicated FPGA targets for training and validating neural network due to their accelerated computational capability compared to the well known GPUs. This platform also has some constraints that need to be assessed and taken care during development (limited memory e.g.). This research presents an implementation to overcome such limitations and obtain as good results as if a GPU would be used. This paper makes use of a state-of-the-art dataset such us Nuscenes which is formed by several sensors and provides seven time more annotations than the KITTI dataset of the 6 cameras, 5 radars and 1 Lidar it is formed by, all with full 360 degree field of view. The presented work proves real-time performance and good detection accuracy when moving part of the CNN presented in the proposed architecture to a commercial FPGA.

Vehicle Pose Estimation via Regression of Semantic Points of Interest 
J.García, A.Agudo and F.Moreno-Noguer
International Symposium on Image and Signal Processing and Analysis (ISPA), 2019

@inproceedings{Garcia_ispa2019,
title = {Vehicle Pose Estimation via Regression of Semantic Points of Interest},
author = {Javier García and Antonio Agudo and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Symposium on Image and Signal Processing and Analysis (ISPA)},
year = {2019}
}

In this paper we address the problem of extracting vehicle 3D pose from 2D RGB images. An accurate methodology is presented that is capable of locating 3D coordinates of 20 pre-defined semantic vehicle points of interest or keypoints from 2D information. The presented two-step pipeline provides a straightforward way of extracting three-dimensional information from planar images and avoiding also the usage of other sensor that would lead to a more expensive and hard to manage system. The main contribution of this work is the presented dedicated network architectures that are able to locate simultaneously occluded and visible semantic points of interest to convert these 2D points into 3D space in a simple but efficient way. The presented method uses a robust network based on Stack-Hourglass architecture for precise prediction of semantic 2D keypoints from vehicles even if they are occluded. Furthermore, in the second step another dedicated network converts the 2D points into 3D world coordinates and therefore, the 3D pose of the vehicle can be automatically extracted, outperforming state- of-the-art techniques in terms of accuracy.

Single View Facial Hair 3D Reconstruction 
G.Rotger, F.Moreno-Noguer, F.Lumbreras and A.Agudo
Iberian Conference on Pattern Recognition and Image Analysis (IBPRIA), 2019

@inproceedings{Rotger_ibpria2019,
title = {Single View Facial Hair 3D Reconstruction},
author = {Gemma Rotger and Francesc Moreno-Noguer and Felipe Lumbreras and Antonio Agudo},
booktitle = {Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis (IBPRIA)},
year = {2019}
}

In this work, we introduce a novel energy-based framework that ad- dresses the challenging problem of 3D reconstruction of facial hair from a single RGB image. To this end, we identify hair pixels over the image via texture analysis and then determine individual hair fibers that are modeled by means of a parametric hair model based on 3D helixes. We propose to minimize an energy composed of several terms, in order to adapt the hair parameters that better fit the image detections. The final hairs respond to the resulting fibers after a post-processing step where we encourage further realism. The resulting approach generates realistic facial hair fibers from solely an RGB image without assuming any training data nor user interaction. We provide an experimental evaluation on real-world pictures where several facial hair styles and image conditions are ob- served, showing consistent results and establishing a comparison with respect to competing approaches.

Detailed 3D Face Reconstruction from a Single RGB Image 
G.Rotger, F.Moreno-Noguer, F.Lumbreras and A.Agudo
International Conference on Computer Graphics, Visualization and Computer Vision (WSCG), 2019

@inproceedings{Rotger_wscg2019,
title = {Detailed 3D Face Reconstruction from a Single RGB Image},
author = {Gemma Rotger and Francesc Moreno-Noguer and Felipe Lumbreras and Antonio Agudo},
booktitle = {Proceedings of the International Conference on Computer Graphics, Visualization and Computer Vision (WSCG)},
year = {2019}
}

This paper introduces a method to obtain a detailed 3D reconstruction of facial skin from a single RGB image. To this end, we propose the exclusive use of an input image without requiring any information about the observed material nor training data to model the wrinkle properties. They are detected and characterized directly from the image via a simple and effective parametric model, determining several features such as location, orientation, width, and height. With these ingredients, we propose to minimize a photometric error to retrieve the final detailed 3D map, which is initialized by current techniques based on deep learning. In contrast with other approaches, we only require estimating a depth parameter, making our approach fast and intuitive. Extensive experimental evaluation is presented in a wide variety of synthetic and real images, including different skin properties and facial expressions. In all cases, our method outperforms the current approaches regarding 3D reconstruction accuracy, providing striking results for both large and fine wrinkles.

Modeling a new Workflow based on Emotional Analysis of Floor-plans using Machine Learning Algorithms and Semiotics 
N.Fatemi, J.Nikolic and F.Moreno-Noguer
International Conference on Virtual City and Territory, 2019

@inproceedings{Fatemi_ctv2019,
title = {Modeling a new Workflow based on Emotional Analysis of Floor-plans using Machine Learning Algorithms and Semiotics},
author = {Nima Fatemi and Jelena Nikolic and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Virtual City and Territory},
year = {2019}
}

The initial purpose of technology is to aid us in repetitive tasks. For example, in recent years, CAD programs are helping Designers to spend more time on the Design itself; being limited by the tool seems like a distant memory. Designers can generate complex forms and plans for their design, however, like our predecessors, we are still open to all kinds of mistakes. With the emergence of Artificial Intelligence, not only we can make machines do a specific task for us, but also learn to guess, predict, and plan for the future and avoiding the same mistakes over (Tech Innovations to Help Manage Project Data and Create New Ways of Designing, 2018). Specifically, Machine learning (ML) is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed (Stuart Russell and Peter Norvig, 2009). As Architects, we are all responsible for what we design and carry out, even further, we are responsible for the effects which our Buildings render into the world. Therefore, in Academia, we approach design as a practice of refinement, it's a process of Generating Alternatives and testing them, over and over, until finding the final option. This is Indeed, very similar to the way an automated machine works except machines are without human error. With the help of our current technologies, we can train machines to learn the design process and aid us in various tasks such as planning, optimization, and prediction for the outcome. One of the most fundamental aspects, regarding the design of a building, is the process of generating plans based on user’s needs; in which many factors are actively affecting the process. Many factors drive the generation/design of an architectural plan and Our Emotions towards a specific space is one of the important ones, which mostly and often dismissed by the Designer. By applying AI to this process; which follows the same principles; the designer is constantly supported by a recorded knowledge that can help him design avoiding such mistakes (Embracing artificial intelligence in architecture, 2018). Our creative goal is to develop an A.I, which can make a dialectic between the designer and the user’s emotion, making the design more efficient for the user. The research aims to find hidden relationships between the factors which shape a floor plan and the user’s emotions; and finding a balance point to establish a new Workflow. The first step to do so is to train a computer program, which learns the relation between our emotions and the design, the latter can be achieved using machine-learning technics, provided with data sets of floor-plans, powered by semantic networks.

Workshop

Unsupervised Image-to-Video Clothing Transfer 
A.Pumarola, V.Goswami, F.Vicente, F. De la Torre and F.Moreno-Noguer
International Conference on Computer Vision Workshops (ICCVW), 2019

@inproceedings{Pumarola_iccvw2019,
title = {Unsupervised Image-to-Video Clothing Transfer},
author = {Albert Pumarola and Vedanuj Goswami and Francisco Vicente and Fernando De la Torre and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision Workshops (ICCVW)},
year = {2019}
}

We present a system to photo-realistically transfer the clothing of a person in a reference image into another person in an unconstrained image or video. Our architecture is based on a GAN equipped with a physical memory that updates an initially incomplete texture map of the clothes that is progressively completed with the new inferred occluded parts. The system is trained in an unsupervised manner. The results are visually appealing and open the possibility to be used in the future as a quick virtual try-on clothing system.

2018

Journal

Force-based Representation for Non-Rigid Shape and Elastic Model Estimation 
A.Agudo and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2018

@article{Agudo_pami2018,
title = {Force-based Representation for Non-Rigid Shape and Elastic Model Estimation},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {40},
number = {9},
issn = {0162-8828},
pages = {2137 - 2150},
doi = {10.1109/TPAMI.2017.2676778},
year = {2018}
}

This paper addresses the problem of simultaneously recovering 3D shape, pose and the elastic model of a deformable object from only 2D point tracks in a monocular video. This is a severely under-constrained problem that has been typically addressed by enforcing the shape or the point trajectories to lie on low-rank dimensional spaces. We show that formulating the problem in terms of a low-rank force space that induces the deformation and introducing the elastic model as an additional unknown, allows for a better physical interpretation of the resulting priors and a more accurate representation of the actual object’s behavior. In order to simultaneously estimate force, pose, and the elastic model of the object we use an expectation maximization strategy, where each of these parameters are successively learned by partial M-steps. Once the elastic model is learned, it can be transfered to similar objects to code its 3D deformation. Moreover, our approach can robustly deal with missing data, and encode both rigid and non-rigid points under the same formalism. We thoroughly validate the approach on Mocap and real sequences, showing more accurate 3D reconstructions than state-of-the-art, and additionally providing an estimate of the full elastic model with no a priori information.

Boosted Random Ferns for Object Detection  
M.Villamizar, J.Andrade, A.Sanfeliu and F.Moreno-Noguer  
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2018

@article{Villamizar_pami2018,
title = {Boosted Random Ferns for Object Detection},
author = {M. Villamizar and J. Andrade and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {40},
number = {2},
issn = {0162-8828},
pages = {272 - 288},
doi = {10.1109/TPAMI.2017.2676778},
year = {2018},
month = {February}
}

In this paper we introduce the Boosted Random Ferns (BRFs) to rapidly build discriminative classifiers for learning and detecting object categories. At the core of our approach we use standard random ferns, but we introduce four main innovations that let us bring ferns from an instance to a category level, and still retain efficiency. First, we define binary features on the histogram of oriented gradients-domain (as opposed to intensity-), allowing for a better representation of intra-class variability. Second, both the positions where ferns are evaluated within the sliding window, and the location of the binary features for each fern are not chosen completely at random, but instead we use a boosting strategy to pick the most discriminative combination of them. This is further enhanced by our third contribution, that is to adapt the boosting strategy to enable sharing of binary features among different ferns, yielding high recognition rates at a low computational cost. And finally, we show that training can be performed online, for sequentially arriving images. Overall, the resulting classifier can be very efficiently trained, densely evaluated for all image locations in about 0.1 seconds, and provides detection rates similar to competing approaches that require expensive and significantly slower processing times. We demonstrate the effectiveness of our approach by thorough experimentation in publicly available datasets in which we compare against state-of-the-art, and for tasks of both 2D detection and 3D multi-view estimation.

A Scalable, Efficient, and Accurate Solution to Non-Rigid Structure from Motion  
A.Agudo and F.Moreno-Noguer  
Computer Vision and Image Understanding (CVIU), 2018

@article{Agudo_cviu2018,
title = {A Scalable, Efficient, and Accurate Solution to Non-Rigid Structure from Motion},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Computer Vision and Image Understanding},
volume = {167},
issue = {C},
issn = {1077-3142},
pages = {121-133},
doi = {10.1016/j.cviu.2018.01.002},
year = {2018},
month = {February}
}

We introduce a new probabilistic point trajectory approach to recover a 3D time-varying shape from RGB video.It can handle scenarios with one or multiple objects, missing, noisy, sparse and dense data, and mild or sharp deformations.In addition, it can incorporate spatial correlation priors that define the similarities between object points.Our approach outperforms state-of-the-art techniques in terms of generality, versatility, accuracy and efficiency. Most Non-Rigid Structure from Motion (NRSfM) solutions are based on factorization approaches that allow reconstructing objects parameterized by a sparse set of 3D points. These solutions, however, are low resolution and generally, they do not scale well to more than a few tens of points. While there have been recent attempts at bringing NRSfM to a dense domain, using for instance variational formulations, these are computationally demanding alternatives which require certain spatial continuity of the data, preventing their use for articulated shapes with large deformations or situations with multiple discontinuous objects. In this paper, we propose incorporating existing point trajectory low-rank models into a probabilistic framework for matrix normal distributions. With this formalism, we can then simultaneously learn shape and pose parameters using expectation maximization, and easily exploit additional priors such as known point correlations. While similar frameworks have been used before to model distributions over shapes, here we show that formulating the problem in terms of distributions over trajectories brings remarkable improvements, especially in generality and efficiency. We evaluate the proposed approach in a variety of scenarios including one or multiple objects, sparse or dense reconstructions, missing observations, mild or sharp deformations, and in all cases, with minimal prior knowledge and low computational cost.

Conference

GANimation: Anatomically-aware Facial Animation from a Single Image   (Oral)
A.Pumarola, A.Agudo, A.M.Martinez, A.Sanfeliu and F.Moreno-Noguer 
European Conference on Computer Vision (ECCV), 2018

@inproceedings{Pumarola_eccv2018,
title = {GANimation: Anatomically-aware Facial Animation from a Single Image},
author = {A. Pumarola and A. Agudo and A.M. Martinez and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2018}
}

Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs' generation process with images of a specific domain, namely a set of images of persons sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combine several of them. Additionally, we propose a fully unsupervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit attention mechanisms that make our network robust to changing backgrounds and lighting conditions. Extensive evaluation show that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild.

Geometry-Aware Network for Non-Rigid Shape Prediction from a Single View
A.Pumarola, A.Agudo, L.Porzi, A.Sanfeliu, V.Lepetit and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

@inproceedings{Pumarola_cvpr2018b,
title = {Geometry-Aware Network for Non-Rigid Shape Prediction from a Single View},
author = {A. Pumarola and A. Agudo and L. Porzi and A. Sanfeliu and V. Lepetit and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

We propose a method for predicting the 3D shape of a deformable surface from a single view. By contrast with previous approaches, we do not need a pre-registered template of the surface, and our method is robust to the lack of texture and partial occlusions. At the core of our approach is a geometry-aware deep architecture that tackles the problem as usually done in analytic solutions: first perform 2D detection of the mesh and then estimate a 3D shape that is geometrically consistent with the image. We train this architecture in an end-to-end manner using a large dataset of synthetic renderings of shapes under different levels of deformation, material properties, textures and lighting conditions. We evaluate our approach on a test split of this dataset and available real benchmarks, consistently improving state-of-the-art solutions with a significantly lower computational time.

Unsupervised Person Image Synthesis in Arbitrary Poses   (Spotlight)
A.Pumarola, A.Agudo, A.Sanfeliu and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

@inproceedings{Pumarola_cvpr2018b,
title = {Unsupervised Person Image Synthesis in Arbitrary Poses},
author = {A. Pumarola and A. Agudo and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

We present a novel approach for synthesizing photorealistic images of people in arbitrary poses using generative adversarial learning. Given an input image of a person and a desired pose represented by a 2D skeleton, our model renders the image of the same person under the new pose, synthesizing novel views of the parts visible in the input image and hallucinating those that are not seen. This problem has recently been addressed in a supervised manner, i.e., during training the ground truth images under the new poses are given to the network. We go beyond these approaches by proposing a fully unsupervised strategy. We tackle this challenging scenario by splitting the problem into two principal subtasks. First, we consider a pose conditioned bidirectional generator that maps back the initially rendered image to the original pose, hence being directly comparable to the input image without the need to resort to any training image. Second, we devise a novel loss function that incorporates content and style terms, and aims at producing images of high perceptual quality. Extensive experiments conducted on the DeepFashion dataset demonstrate that the images rendered by our model are very close in appearance to those obtained by fully supervised approaches.

Image Collection Pop-up: 3D Reconstruction and Clustering of Rigid and Non-Rigid Categories   (Spotlight)
A.Agudo, M.Pijoan and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2018

@inproceedings{Agudo_cvpr2018,
title = {Image Collection Pop-up: 3D Reconstruction and Clustering of Rigid and Non-Rigid Categories },
author = {A. Agudo and M. Pijoan and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2018}
}

This paper introduces an approach to simultaneously estimate 3D shape, camera pose, and object and type of deformation clustering, from partial 2D annotations in a multi-instance collection of images. Furthermore, we can indistinctly process rigid and non-rigid categories. This advances existing work, which only addresses the problem for one single object or, if multiple objects are considered, they are assumed to be clustered a priori. To handle this broader version of the problem, we model object deformation using a formulation based on multiple unions of subspaces, able to span from small rigid motion to complex deformations. The parameters of this model are learned via Augmented Lagrange Multipliers, in a completely unsupervised manner that does not require any training data at all. Extensive validation is provided in a wide variety of synthetic and real scenarios, including rigid and non-rigid categories with small and large deformations. In all cases our approach outperforms state-of-the-art in terms of 3D reconstruction accuracy, while also providing clustering results that allow segmenting the images into object instances and their associated type of deformation (or action the object is performing).

Visual Re-ranking with Natural Language Understanding for Text Spotting   (Oral)
A.Sabir, F.Moreno-Noguer and L.Padro 
Asian Conference on Computer Vision (ACCV), 2018

@inproceedings{Sabir_accv2018,
title = {Visual Re-ranking with Natural Language Understanding for Text Spotting},
author = {Ahmed Sabir and Francesc Moreno-Noguer and Lluis Padro},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
year = {2018}
}

Many scene text recognition approaches are based on purely visual information and ignore the semantic relation between scene and text. In this paper, we tackle this problem from natural language processing perspective to fill the gap between language and vision. We propose a post processing approach to improve scene text recognition accuracy by using occurrence probabilities of words (unigram language model), and the semantic correlation between scene and text. For this, we initially rely on an off-the-shelf deep neural network, already trained with large amount of data, which provides a series of text hypotheses per input image. These hypotheses are then re-ranked using word frequencies and semantic relatedness with objects or scenes in the image. As a result of this combination, the performance of the original network is boosted with almost no additional cost. We validate our approach on ICDAR’17 dataset.

Hallucinating Dense Optical Flow from Sparse Lidar for Autonomous Vehicles  
V.Vaquero, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Pattern Recognition (ICPR), 2018

@inproceedings{Vaquero_icpr2018,
title = {Hallucinating Dense Optical Flow from Sparse Lidar for Autonomous Vehicles},
author = {V. Vaquero and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Pattern Recognition, (ICPR)},
year = {2018}
}

In this paper we propose a novel approach to estimate dense optical flow from sparse lidar data acquired on an autonomous vehicle. This is intended to be used as a drop-in replacement of any image-based optical flow system when images are not reliable due to e.g. adverse weather conditions or at night. In order to infer high resolution 2D flows from discrete range data we devise a three-block architecture of multiscale filters that combines multiple intermediate objectives, both in the lidar and image domain. To train this network we introduce a dataset with approximately 20K lidar samples of the Kitti dataset which we have augmented with a pseudo ground-truth image-based optical flow computed using FlowNet2. We demonstrate the effectiveness of our approach on Kitti, and show that despite using the low-resolution and sparse measurements of the lidar, we can regress dense optical flow maps which are at par with those estimated with image-based methods.

2D-to-3D Facial Expression Transfer  
G.Rotger, F.Lumbreras, F.Moreno-Noguer and A.Agudo 
International Conference on Pattern Recognition (ICPR), 2018

@inproceedings{Rotger_icpr2018,
title = {2D-to-3D Facial Expression Transfer},
author = {G. Rotger and F. Lumbreras and F. Moreno-Noguer and A. Agudo},
booktitle = {Proceedings of the International Conference on Pattern Recognition, (ICPR)},
year = {2018}
}

Automatically changing the expression and physical features of a face from an input image is a topic that has been traditionally tackled in a 2D domain. In this paper, we bring this problem to 3D and propose a framework that given an input RGB video of a human face under a neutral expression, initially computes his/her 3D shape and then performs a transfer to a new and potentially non-observed expression. For this purpose, we parameterize the rest shape –obtained from standard factorization approaches over the input video– using a triangular mesh which is further clustered into larger macro-segments. The expression transfer problem is then posed as a direct mapping between this shape and a source shape, such as the blend shapes of an off-the-shelf 3D dataset of human facial expressions. The mapping is resolved to be geometrically consistent between 3D models by requiring points in specific regions to map on semantic equivalent regions. We validate the approach on several synthetic and real examples of input faces that largely differ from the source shapes, yielding very realistic expression transfers even in cases with topology changes, such as a synthetic video sequence of a single-eyed cyclops.

Deformable Motion 3D Reconstruction by Union of Regularized Subspaces  
A.Agudo and F.Moreno-Noguer 
International Conference on Image Processing (ICIP), 2018

@inproceedings{Agudo_icip2018,
title = {Deformable Motion 3D Reconstruction by Union of Regularized Subspaces},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Image Processing, (ICIP)},
year = {2018}
}

This paper presents an approach to jointly retrieve camera pose, time-varying 3D shape, and automatic clustering based on motion primitives, from incomplete 2D trajectories in a monocular video. We introduce the concept of order-varying temporal regularization in order to exploit video data, that can be indistinctly applied to the 3D shape evolution as well as to the similarities between images. This results in a union of regularized subspaces which effectively encodes the 3D shape deformation. All parameters are learned via augmented Lagrange multipliers, in a unified and unsupervised manner that does not assume any training data at all. Experimental validation is reported on human motion from sparse to dense shapes, providing more robust and accurate solutions than state-of-the-art approaches in terms of 3D reconstruction, while also obtaining motion grouping results.

Deep Lidar CNN to Understand the Dynamics of Moving Vehicles  
V.Vaquero, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Robotics and Automation (ICRA), 2018

@inproceedings{Vaquero_icra2018,
title = {Deep Lidar CNN to Understand the Dynamics of Moving Vehicles},
author = {V. Vaquero and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2018}
}

Perception technologies in Autonomous Driving are experiencing their golden age due to the advances in Deep Learning. Yet, most of these systems rely on the semantically rich information of RGB images. Deep Learning solutions applied to the data of other sensors typically mounted on autonomous cars (e.g. lidars or radars) are not explored much. In this paper we propose a novel solution to understand the dynamics of moving vehicles of the scene from only lidar information. The main challenge of this problem stems from the fact that we need to disambiguate the proprio-motion of the “observer” vehicle from that of the external “observed” vehicles. For this purpose, we devise a CNN architecture which at testing time is fed with pairs of consecutive lidar scans. However, in order to properly learn the parameters of this network, during training we introduce a series of so-called pretext tasks which also leverage on image data. These tasks include semantic information about vehicleness and a novel lidar-flow feature which combines standard image-based optical flow with lidar scans. We obtain very promising results and show that including distilled image information only during training, allows improving the inference results of the network at test time, even when image data is no longer used.

Visual Semantic Re-ranker for Text Spotting  
A.Sabir, F.Moreno-Noguer and L.Padro 
Iberoamerican Congress on Pattern Recognition (CIARP), 2018

@inproceedings{Sabir_ciarp2018,
title = {Visual Semantic Re-ranker for Text Spotting},
author = {Ahmed Sabir and Francesc Moreno-Noguer and Lluis Padro},
booktitle = {Proceedings of the Iberoamerican Congress on Pattern Recognition (CIARP)},
year = {2018}
}

Many current state-of-the-art methods for text recognition are based on purely local information and ignore the semantic correlation between text and its surrounding visual context. In this paper, we propose a post-processing approach to improve the accuracy of text spotting by using the semantic relation between the text and the scene. We initially rely on an off-the-shelf deep neural network that provides a series of text hypotheses for each input image. These text hypotheses are then re-ranked using the semantic relatedness with the object in the image. As a result of this combination, the performance of the original network is boosted with a very low computational cost. The proposed framework can be used as a drop-in complement for any text-spotting algorithm that outputs a ranking of word hypotheses. We validate our approach on ICDAR’17 shared task dataset.

Enhancing Text Spotting with a Language Model and Visual Context Information  
A.Sabir, F.Moreno-Noguer and L.Padro 
International Conference of the Catalan Association of Artificial Intelligence (CCIA), 2018

@inproceedings{Sabir_ccia2018,
title = {Enhancing Text Spotting with a Language Model and Visual Context Information},
author = {Ahmed Sabir and Francesc Moreno-Noguer and Lluis Padro},
booktitle = {Proceedings of the International Conference of the Catalan Association of Artificial Intelligence (CCIA)},
year = {2018}
}

This paper addresses the problem of detecting and recognizing text in images acquired ‘in the wild’. This is a severely under-constrained problem which needs to tackle a number of challenges including large occlusions, changing lighting conditions, cluttered backgrounds and different font types and sizes. In order to address this problem we leverage on recent and successful developments in the cross-fields of machine learning and natural language understanding. In particular, we initially rely on off-the-shelf deep networks already trained with large amounts of data and that provide a series of text hypotheses per input image. The outputs of this network are then combined with different priors obtained from both the semantic interpretation of the image and from a scene-based language model. As a result of this combination, the performance of the original network is consistently boosted. We validate our approach on ICDAR’17 shared task dataset.

Vehicle Pose Estimation using G-Net: Multi-Class Localization and Depth Estimation  
J.Garcia, A.Agudo and F.Moreno-Noguer 
International Conference of the Catalan Association of Artificial Intelligence (CCIA), 2018

@inproceedings{Garcia_ccia2018,
title = {Vehicle Pose Estimation using G-Net: Multi-Class Localization and Depth Estimation},
author = {Javier Garcia and Antonio Agudo and Francesc Moreno-Noguer},
booktitle = {Proceedings of the International Conference of the Catalan Association of Artificial Intelligence (CCIA)},
year = {2018}
}

In this paper we present a new network architecture, called G-Net, for 3D pose estimation on RGB images which is trained in a weakly supervised manner. We introduce a two step pipeline based on region-based Convolutional neural networks (CNNs) for feature localization, bounding box refinement based on non-maximum-suppression and depth estimation. The G-Net is able to estimate the depth from single monocular images with a self-tuned loss function. The combi- nation of this predicted depth and the presented two-step localization allows the extraction of the 3D pose of the object. We show in experiments that our method achieves good results compared to other state-of-the-art approaches which are trained in a fully supervised manner.

2017

Journal

BreakingNews: Article Annotation by Image and Text Processing 
A.Ramisa, F.Yan, F.Moreno-Noguer and K.Mikolajczyk
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2017

@article{Ramisa_pami2017,
title = {BreakingNews: Article Annotation by Image and Text Processing},
author = {A. Ramisa and F. Yan and F. Moreno-Noguer and K. Mikolajczyk},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {40},
number = {5},
issn = {0162-8828},
pages = {1072 - 1085},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2721945},
year = {2017},
month = {June}
}

Building upon recent Deep Neural Network architectures, current approaches lying in the intersection of computer vision and natural language processing have achieved unprecedented breakthroughs in tasks like automatic captioning or image retrieval. Most of these learning methods, though, rely on large training sets of images associated with human annotations that specifically describe the visual content. In this paper we propose to go a step further and explore the more complex cases where textual descriptions are loosely related to the images. We focus on the particular domain of News articles in which the textual content often expresses connotative and ambiguous relations that are only suggested but not directly inferred from images. We introduce new deep learning methods that address source detection, popularity prediction, article illustration and geolocation of articles. An adaptive CNN architecture is proposed, that shares most of the structure for all the tasks, and is suitable for multitask and transfer learning. Deep Canonical Correlation Analysis is deployed for article illustration, and a new loss function based on Great Circle Distance is proposed for geolocation. Furthermore, we present BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (such as GPS coordinates and popularity metrics). We show this dataset to be appropriate to explore all aforementioned problems, for which we provide a baseline performance using various Deep Learning architectures, and different representations of the textual and visual features. We report very promising results and bring to light several limitations of current state-of-the-art in this kind of domain, which we hope will help spur progress in the field.

Combining Local-Physical and Global-Statistical Models for Sequential Deformable Shape from Motion
A.Agudo and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2017

@article{Agudo_ijcv2017,
title = {Combining Local-Physical and Global-Statistical Models for Sequential Deformable Shape from Motion},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision},
volume = {122},
number = {2},
issn = {0920-5691},
pages = {371-387},
doi = {https://doi.org/10.1007/s11263-016-0972-8},
year = {2017},
month = {April}
}

In this paper, we simultaneously estimate camera pose and non-rigid 3D shape from a monocular video, using a sequential solution that combines local and global representations. We model the object as an ensemble of particles, each ruled by the linear equation of the Newton’s second law of motion. This dynamic model is incorporated into a bundle adjustment framework, in combination with simple regularization components that ensure temporal and spatial consistency. The resulting approach allows to sequentially estimate shape and camera poses, while progressively learning a global low-rank model of the shape that is fed back into the optimization scheme, introducing thus, global constraints. The overall combination of local (physical) and global (statistical) constraints yields a solution that is both efficient and robust to several artifacts such as noisy and missing data or sudden camera motions, without requiring any training data at all. Validation is done in a variety of real application domains, including articulated and non-rigid motion, both for continuous and discontinuous shapes. Our on-line methodology yields significantly more accurate reconstructions than competing sequential approaches, being even comparable to the more computationally demanding batch methods.

3D Human Pose Tracking Priors using Geodesic Mixture Models
E.Simo-Serra, C.Torras and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2017

@article{Simo_ijcv2017,
title = {3D Human Pose Tracking Priors using Geodesic Mixture Models},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision},
volume = {122},
number = {2},
issn = {0920-5691},
pages = {388-408},
doi = {https://doi.org/10.1007/s11263-016-0941-2},
year = {2017},
month = {April}
}

We present a novel approach for learning a finite mixture model on a Riemannian manifold in which Euclidean metrics are not applicable and one needs to resort to geodesic distances consistent with the manifold geometry. For this purpose, we draw inspiration on a variant of the expectation-maximization algorithm, that uses a minimum message length criterion to automatically estimate the optimal number of components from multivariate data lying on an Euclidean space. In order to use this approach on Riemannian manifolds, we propose a formulation in which each component is defined on a different tangent space, thus avoiding the problems associated with the loss of accuracy produced when linearizing the manifold with a single tangent space. Our approach can be applied to any type of manifold for which it is possible to estimate its tangent space. Additionally, we consider using shrinkage covariance estimation to improve the robustness of the method, especially when dealing with very sparsely distributed samples. We evaluate the approach on a number of situations, going from data clustering on manifolds to combining pose and kinematics of articulated bodies for 3D human pose tracking. In all cases, we demonstrate remarkable improvement compared to several chosen baselines.

Random Clustering Ferns for Multimodal Object Recognition
M.Villamizar, A.Garrell, A.Sanfeliu and F.Moreno-Noguer
Neural Computing and Applications, 2017

@article{Villamizar_neurocomputing2017,
title = {Random Clustering Ferns for Multimodal Object Recognition},
author = {M. Villamizar and A. Garrell and A.Sanfeliu and F. Moreno-Noguer},
booktitle = {Neural Computing and Applications},
volume = {28},
number = {9},
issn = {0941-0643},
pages = {2445–2460},
doi = {https://doi.org/10.1007/s00521-016-2284-x},
year = {2017},
month = {September}
}

We propose an efficient and robust method for the recognition of objects exhibiting multiple intra-class modes, where each one is associated to a particular object appearance. The proposed method, called Random Clustering Ferns (RCFs), combines synergically a single and real-time classifier, based on the boosted assembling of extremely-randomized trees (ferns), with an unsupervised and probabilistic approach in order to recognize efficiently object instances in images and discover simultaneously the most prominent appearance modes of the object through tree-structured visual words. In particular, we use Boosted Random Ferns (BRFs) and probabilistic Latent Semantic Analysis (pLSA) to obtain a discriminative and multimodal classifier that automatically clusters the response of its randomized trees in function of the visual object appearance. The proposed method is validated extensively in synthetic and real experiments, showing that the method is capable of detecting objects with diverse and complex appearance distributions in real-time performance.

Learning Depth-aware Deep Representations for Robotic Perception
L.Porzi, S.Rota, A.Peñate-Sánchez, E.Ricci and F.Moreno-Noguer
Robotics and Automation Letters, 2017

@article{Porzi_ral2017,
title = {Learning Depth-aware Deep Representations for Robotic Perception},
author = {L. Porzi and S. Rota and A. Peñate-Sánchez and E. Ricci and F. Moreno-Noguer},
booktitle = {Robotics and Automation Letters},
volume = {2},
number = {2},
issn = {2377-3766},
pages = {468-475},
doi = {https://doi.org/10.1109/LRA.2016.2637444},
year = {2017},
month = {April}
}

Exploiting RGB-D data by means of Convolutional Neural Networks (CNNs) is at the core of a number of robotics applications, including object detection, scene semantic segmentation and grasping. Most existing approaches, however, exploit RGB-D data by simply considering depth as an additional input channel for the network. In this paper we show that the performance of deep architectures can be boosted by introducing DaConv, a novel, general-purpose CNN block which exploits depth to learn scale-aware feature representations. We demonstrate the benefits of DaConv on a variety of robotics oriented tasks, involving affordance detection, object coordinate regression and contour detection in RGB-D images. In each of these experiments we show the potential of the proposed block and how it can be readily integrated into existing CNN architectures.

Teaching Robot's Proactive Behavior Using Human Assistance
A.Garrell, M.Villamizar, F.Moreno-Noguer and A.Sanfeliu
International Journal of Social Robotics, 2017

@article{Garrell_ijsr2017,
title = {Teaching Robot's Proactive Behavior Using Human Assistance},
author = {A. Garrell and M. Villamizar and F. Moreno-Noguer and A. Sanfeliu},
booktitle = {International Journal of Social Robotics},
volume = {2},
number = {9},
issn = {1875-4791},
pages = {231—249},
doi = {https://doi.org/10.1007/s12369-016-0389-0},
year = {2017},
month = {April}
}

In recent years, there has been a growing interest in enabling autonomous social robots to interact with people. However, many questions remain unresolved regarding the social capabilities robots should have in order to perform this interaction in an ever more natural manner. In this paper, we tackle this problem through a comprehensive study of various topics involved in the interaction between a mobile robot and untrained human volunteers for a variety of tasks. In particular, this work presents a framework that enables the robot to proactively approach people and establish friendly interaction. To this end, we provided the robot with several perception and action skills, such as that of detecting people, planning an approach and communicating the intention to initiate a conversation while expressing an emotional status. We also introduce an interactive learning system that uses the person’s volunteered assistance to incrementally improve the robot’s perception skills. As a proof of concept, we focus on the particular task of online face learning and recognition. We conducted real-life experiments with our Tibi robot to validate the framework during the interaction process. Within this study, several surveys and user studies have been realized to reveal the social acceptability of the robot within the context of different tasks.

TED: A Tolerant Edit Distance for Segmentation Evaluation
J.Funke, J.Klein, F.Moreno-Noguer, A.Cardona and M.Cook
Methods, 2017

@article{Funke_methods2017,
title = {TED: A Tolerant Edit Distance for Segmentation Evaluation},
author = {J. Funke and J. Klein and F. Moreno-Noguer and A. Cardona and M. Cook},
booktitle = {Methods},
volume = {115},
number = {15},
issn = {1046-2023},
pages = {119—127},
doi = {https://doi.org/10.1016/j.ymeth.2016.12.013},
year = {2017},
month = {February}
}

In this paper, we present a novel error measure to compare a computer-generated segmentation of images or volumes against ground truth. This measure, which we call Tolerant Edit Distance (TED), is motivated by two observations that we usually encounter in biomedical image processing: (1) Some errors, like small boundary shifts, are tolerable in practice. Which errors are tolerable is application dependent and should be explicitly expressible in the measure. (2) Non-tolerable errors have to be corrected manually. The effort needed to do so should be reflected by the error measure. Our measure is the minimal weighted sum of split and merge operations to apply to one segmentation such that it resembles another segmentation within specified tolerance bounds. This is in contrast to other commonly used measures like Rand index or variation of information, which integrate small, but tolerable, differences. Additionally, the TED provides intuitive numbers and allows the localization and classification of errors in images or volumes. We demonstrate the applicability of the TED on 3D segmentations of neurons in electron microscopy images where topological correctness is arguable more important than exact boundary locations. Furthermore, we show that the TED is not just limited to evaluation tasks. We use it as the loss function in a max-margin learning framework to find parameters of an automatic neuron segmentation algorithm. We show that training to minimize the TED, i.e., to minimize crucial errors, leads to higher segmentation accuracy compared to other learning methods.

Conference

3D Human Pose Estimation from a Single Image via Distance Matrix Regression
F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2017

@inproceedings{Moreno_cvpr2017,
title = {3D Human Pose Estimation from a Single Image via Distance Matrix Regression},
author = {F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017}
}

This paper addresses the problem of 3D human pose estimation from a single image. We follow a standard two-step pipeline by first detecting the 2D position of the N body joints, and then using these observations to infer 3D pose. For the first step, we use a recent CNN-based detector. For the second step, most existing approaches perform 2N-to-3N regression of the Cartesian joint coordinates. We show that more precise pose estimates can be obtained by representing both the 2D and 3D human poses using N × N distance matrices, and formulating the problem as a 2D-to-3D distance matrix regression. For learning such a regressor we leverage on simple Neural Network architectures, which by construction, enforce positivity and symmetry of the predicted matrices. The approach has also the advantage to naturally handle missing observations and allowing to hypothesize the position of non-observed joints. Quantitative results on Humaneva and Human3.6M datasets demonstrate consistent performance gains over state-of-the-art. Qualitative evaluation on the images in-the-wild of the LSP dataset, using the regressor learned on Human3.6M, reveals very promising generalization results.

DUST: Dual Union of Spatio-Temporal Subspaces for Monocular Multiple Object 3D Reconstruction
A.Agudo and F.Moreno-Noguer 
Conference in Computer Vision and Pattern Recognition (CVPR), 2017

@inproceedings{Agudo_cvpr2017,
title = {DUST: Dual Union of Spatio-Temporal Subspaces for Monocular Multiple Object 3D Reconstruction},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017}
}

We present an approach to reconstruct the 3D shape of multiple deforming objects from incomplete 2D trajectories acquired by a single camera. Additionally, we simultaneously provide spatial segmentation (i.e., we identify each of the objects in every frame) and temporal clustering (i.e., we split the sequence into primitive actions). This advances existing work, which only tackled the problem for one single object and non-occluded tracks. In order to handle several objects at a time from partial observations, we model point trajectories as a union of spatial and temporal subspaces, and optimize the parameters of both modalities, the non-observed point tracks and the 3D shape via augmented Lagrange multipliers. The algorithm is fully unsupervised and results in a formulation which does not need initialization. We thoroughly validate the method on challenging scenarios with several human subjects performing different activities which involve complex motions and close interaction. We show our approach achieves state-of-the-art 3D reconstruction results, while it also provides spatial and temporal segmentation.

3D CNNs on Distance Matrices for Human Action Recognition
A.Hernandez, L.Porzi, S.Rota and F.Moreno-Noguer 
ACM Conference on Multimedia (ACM'MM), 2017

@inproceedings{Hernandez_acmmm2017,
title = {3D CNNs on Distance Matrices for Human Action Recognition},
author = {A. Hernandez and L. Porzi and S. Rota and F. Moreno-Noguer},
booktitle = {Proceedings of the ACM Conference on Multimedia (ACM'MM)},
year = {2017}
}

In this paper we are interested in recognizing human actions from sequences of 3D skeleton data. For this purpose we combine a 3D Convolutional Neural Network with body representations based on Euclidean Distance Matrices (EDMs), which have been recently shown to be very effective to capture the geometric structure of the human pose. One inherent limitation of the EDMs, however, is that they are defined up to a permutation of the skeleton joints, i.e., randomly shuffling the ordering of the joints yields many different representations. In oder to address this issue we introduce a novel architecture that simultaneously, and in an end-to-end manner, learns an optimal transformation of the joints, while optimizing the rest of parameters of the convolutional network. The proposed approach achieves state-of-the-art results on 3 benchmarks, including the recent NTU RGB-D dataset, for which we improve on previous LSTM-based methods by more than 10 percentage points, also surpassing other CNN-based methods while using almost 1000 times fewer parameters.

Global Model with Local Interpretation for Dynamic Shape Reconstruction
A.Agudo and F.Moreno-Noguer 
Winter Conference on Applications of Computer Vision (WACV), 2017

@inproceedings{Agudo_wacv2017,
title = {Global Model with Local Interpretation for Dynamic Shape Reconstruction},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)},
year = {2017}
}

The most standard approach to resolve the inherent ambiguities of the non-rigid structure from motion problem is using low-rank models that approximate deforming shapes by a linear combination of rigid basis. These models are typically global, i.e., each shape basis contributes equally to all points of the surface. While this approach has been shown effective to represent smooth deformations, its performance degrades for surfaces composed of various regions, each following a different deformation rule. Piecewise methods attempt to capture this type of behavior by locally modeling surface patches, although they subsequently require enforcing global constraints to assemble back the patches. In this paper we propose an approach that combines the best of global and local models: it locally considers low-rank models but, by construction, does not need to impose global constraints to guarantee local patch continuity. We achieve this by a simple expectation maximization strategy that besides learning global shape bases, it locally adapts their contribution to each specific surface region. Furthermore, as a side contribution, in order to split the surface into different local patches, we propose a novel physically-based mesh segmentation approach that obeys an energy criterion. The complete framework is evaluated in both synthetic and real datasets, and shows an improved performance to competing methods.

Multi-Modal Joint Embedding for Fashion Product Retrieval
A.Rubio, L.Yu, E.Simo-Serra and F.Moreno-Noguer 
International Conference on Image Processing (ICIP), 2017

@inproceedings{Rubio_icip2017,
title = {Multi-Modal Joint Embedding for Fashion Product Retrieval},
author = {A. Rubio and L. Yu and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Image Processing (ICIP)},
year = {2017}
}

Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem, akin to finding a needle in a haystack. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space, which is both efficient and accurate. We train this embedding using large-scale real world e-commerce data by both minimizing the similarity between related products and using auxiliary classification networks to that encourage the embedding to have semantic meaning. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset. We also provide an analysis of the different metadata.

Joint Coarse-and-fine Reasoning for Deep Optical Flow
V.Vaquero, G.Ros, F.Moreno-Noguer, A.López and A.Sanfeliu 
International Conference on Image Processing (ICIP), 2017

@inproceedings{Vaquero_icip2017,
title = {Joint Coarse-and-fine Reasoning for Deep Optical Flow},
author = {V. Vaquero and G. Ros and F. Moreno-Noguer and A. López and A. Sanfeliu},
booktitle = {Proceedings of the International Conference on Image Processing (ICIP)},
year = {2017}
}

We propose a novel representation for dense pixel-wise estimation tasks using CNNs that boosts accuracy and reduces training time, by explicitly exploiting joint coarse-and-fine reasoning. The coarse reasoning is performed over a discrete classification space to obtain a general rough solution, while the fine details of the solution are obtained over a continuous regression space. In our approach both components are jointly estimated, which proved to be beneficial for improving estimation accuracy. Additionally, we propose a new network architecture, which combines coarse and fine components by treating the fine estimation as a refinement built on top of the coarse solution, and therefore adding details to the general prediction. We apply our approach to the challenging problem of optical flow estimation and empirically validate it against state-of-the-art CNN-based solutions trained from scratch and tested on large optical flow datasets.

PL-SLAM: Real-Time Monocular Visual SLAM with Points and Lines  
A.Pumarola, A.Vakhitov, A.Agudo, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Robotics and Automation (ICRA), 2017

@inproceedings{Pumarola_icra2017,
title = {PL-SLAM: Real-Time Monocular Visual SLAM with Points and Lines},
author = {A. Pumarola and A. Vakhitov and A. Agudo and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2017}
}

Low textured scenes are well known to be one of the main Achilles heels of geometric computer vision algorithms relying on point correspondences, and in particular for visual SLAM. Yet, there are many environments in which, despite being low textured, one can still reliably estimate line-based geometric primitives, for instance in city and indoor scenes, or in the so-called “Manhattan worlds”, where structured edges are predominant. In this paper we propose a solution to handle these situations. Specifically, we build upon ORB-SLAM, presumably the current state-of-the-art solution both in terms of accuracy as efficiency, and extend its formulation to simultaneously handle both point and line correspondences. We propose a solution that can even work when most of the points are vanished out from the input images, and, interestingly it can be initialized from solely the detection of line correspondences in three consecutive frames. We thoroughly evaluate our approach and the new initialization strategy on the TUM RGB-D benchmark and demonstrate that the use of lines does not only improve the performance of the original ORB-SLAM solution in poorly textured frames, but also systematically improves it in sequence frames combining points and lines, without compromising the efficiency.

Learning Depth-aware Deep Representations for Robotic Perception  
L.Porzi, A.Peñate-Sánchez, E.Ricci and F.Moreno-Noguer 
International Conference on Intelligent Robots and Systems (IROS), 2017

@inproceedings{Porzi_iros2017,
title = {Learning Depth-aware Deep Representations for Robotic Perception},
author = {L. Porzi and A. Peñate-Sánchez and E. Ricci and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Intelligent Robots and Systems (IROS)},
year = {2017}
}

Most recent approaches to 3D pose estimation from RGB-D images address the problem in a two-stage pipeline. First, they learn a classifier –typically a random forest– to predict the position of each input pixel on the object surface. These estimates are then used to define an energy function that is minimized w.r.t. the object pose. In this paper, we focus on the first stage of the problem and propose a novel classifier based on a depth-aware Convolutional Neural Network. This classifier is able to learn a scale-adaptive regression model that yields very accurate pixel-level predictions, allowing to finally estimate the pose using a simple RANSAC-based scheme, with no need to optimize complex ad hoc energy functions. Our experiments on publicly available datasets show that our approach achieves remarkable improvements over state-of-the-art methods.

Deconvolutional Networks for Point-cloud Vehicle Detection and Tracking in Driving Scenarios  
V.Vaquero, I. del Pino, F.Moreno-Noguer, J.Solà, A.Sanfeliu and J.Andrade-Cetto 
European Conference on Mobile Robots (ECMR), 2017

@inproceedings{Vaquero_ecmr2017,
title = {Deconvolutional Networks for Point-cloud Vehicle Detection and Tracking in Driving Scenarios},
author = {V. Vaquero and I. del Pino and F. Moreno-Noguer and J. Solà and A. Sanfeliu and J. Andrade-Cetto},
booktitle = {Proceedings of the European Conference on Mobile Robots (ECMR)},
year = {2017}
}

Vehicle detection and tracking is a core ingredient for developing autonomous driving applications in urban scenarios. Recent image-based Deep Learning (DL) techniques are obtaining breakthrough results in these perceptive tasks. However, DL research has not yet advanced much towards processing 3D point clouds from lidar range-finders. These sensors are very common in autonomous vehicles since, despite not providing as semantically rich information as images, their performance is more robust under harsh weather conditions than vision sensors. In this paper we present a full vehicle detection and tracking system that works with 3D lidar information only. Our detection step uses a Convolutional Neural Network (CNN) that receives as input a featured representation of the 3D information provided by a Velodyne HDL-64 sensor and returns a per-point classification of whether it belongs to a vehicle or not. The classified point cloud is then geometrically processed to generate observations for a multi-object tracking system implemented via a number of Multi-Hypothesis Extended Kalman Filters (MH-EKF) that estimate the position and velocity of the surrounding vehicles. The system is thoroughly evaluated on the KITTI tracking dataset, and we show the performance boost provided by our CNN-based vehicle detector over a standard geometric approach. Our lidar-based approach uses about a 4% of the data needed for an image-based detector with similarly competitive results.

Low Resolution Lidar-Based Multi-Object Tracking for Driving Applications  
I. del Pino, V.Vaquero, B.Masini, J.Solà, F.Moreno-Noguer, A.Sanfeliu and J.Andrade-Cetto 
Iberian Robotics Conference, ROBOT, 2017

@inproceedings{DelPino_Robot2017,
title = {Low Resolution Lidar-Based Multi-Object Tracking for Driving Applications},
author = {I. del Pino and V. Vaquero and B. Masini and J. Solà and F. Moreno-Noguer and A. Sanfeliu and J. Andrade-Cetto},
booktitle = {Proceedings of the Third Iberian Robotics Conference, ROBOT},
year = {2017}
}

Vehicle detection and tracking in real scenarios are key components to develop assisted and autonomous driving systems. Lidar sensors are specially suitable for this task, as they bring robustness to harsh weather conditions while providing accurate spatial information. However, the resolution provided by point cloud data is very scarce in comparison to camera images. In this work we explore the possibilities of Deep Learning (DL) methodologies applied to low resolution 3D lidar sensors such as the Velodyne VLP-16 (PUCK), in the context of vehicle detection and tracking. For this purpose we developed a lidar-based system that uses a Convolutional Neural Network (CNN), to perform point-wise vehicle detection using PUCK data, and Multi-Hypothesis Extended Kalman Filters (MH-EKF), to estimate the actual position and velocities of the detected vehicles. Comparative studies between the proposed lower resolution (VLP-16) tracking system and a high-end system, using Velodyne HDL-64, were carried out on the Kitti Tracking Benchmark dataset. Moreover, to analyze the influence of the CNN-based vehicle detection approach, comparisons were also performed with respect to the geometric-only detector. The results demonstrate that the proposed low resolution Deep Learning architecture is able to successfully accomplish the vehicle detection task, outperforming the geometric baseline approach. Moreover, it has been observed that our system achieves a similar tracking performance to the high-end HDL-64 sensor at close range. On the other hand, at long range, detection is limited to half the distance of the higher-end sensor.

Workshop

Multi-Modal Embedding for Main Product Detection in Fashion   (Best Paper Award)
A.Rubio, L.Yu, E.Simo-Serra and F.Moreno-Noguer 
Fashion Workshop in International Conference on Computer Vision (ICCVw), 2017

@inproceedings{Rubio_iccvw2017,
title = {Multi-Modal Embedding for Main Product Detection in Fashion},
author = {A. Rubio and L. Yu and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision Workshops (ICCVW)},
year = {2017}
}

We present an approach to detect the main product in fashion images by exploiting the textual metadata associated with each image. Our approach is based on a Convolutional Neural Network and learns a joint embedding of object proposals and textual metadata to predict the main product in the image. We additionally use several complementary classification and overlap losses in order to improve training stability and performance. Our tests on a large-scale dataset taken from eight e-commerce sites show that our approach outperforms strong baselines and is able to accurately detect the main product in a wide diversity of challenging fashion images.

The BreakingNews Dataset  
A.Ramisa, F.Yan, F.Moreno-Noguer and K.Mikolajczyk 
Workshop on Vision and Language, 2017

@inproceedings{Ramisa_VL2017,
title = {The BreakingNews Dataset},
author = {A. Ramisa and F. Yan and F. Moreno-Noguer and K. Mikolajczyk},
booktitle = {Workshop on Vision and Language},
year = {2017}
}

We present BreakingNews, a novel dataset with approximately 100K news articles including images, text and captions, and enriched with heterogeneous meta-data (e.g. GPS coordinates and popularity metrics). The tenuous connection between the images and text in news data is appropriate to take work at the intersection of Computer Vision and Natural Language Processing to the next step, hence we hope this dataset will help spur progress in the field.

Multi-Modal Fashion Product Retrieval  
A.Rubio, L.Yu, E.Simo-Serra and F.Moreno-Noguer 
Workshop on Vision and Language, 2017

@inproceedings{Rubio_VL2017,
title = {Multi-Modal Fashion Product Retrieval},
author = {A. Rubio and L. Yu and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Workshop on Vision and Language},
year = {2017}
}

Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset.

2016

Journal

Sequential Non-Rigid Structure from Motion using Physical Priors 
A.Agudo, F.Moreno-Noguer, B.Calvo and J.M.M.Montiel
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2016

@article{Agudo_pami2016,
title = {Sequential Non-Rigid Structure from Motion using Physical Priors},
author = {A. Agudo and F. Moreno-Noguer and B. Calvo and J.M.M. Montiel},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {38},
number = {5},
issn = {0162-8828},
pages = {979-994},
doi = {10.1109/TPAMI.2015.2469293},
year = {2016},
month = {May}
}

We propose a new approach to simultaneously recover camera pose and 3D shape of non-rigid and potentially extensible surfaces from a monocular image sequence. For this purpose, we make use of the EKF-SLAM (Extended Kalman Filter based Simultaneous Localization And Mapping) formulation, a Bayesian optimization framework traditionally used in mobile robotics for estimating camera pose and reconstructing rigid scenarios. In order to extend the problem to a deformable domain we represent the object’s surface mechanics by means of Navier’s equations, which are solved using a FEM (Finite Element Method). With these main ingredients, we can further model the material’s stretching, allowing us to go a step further than most of current techniques, typically constrained to surfaces undergoing isometric deformations. We extensively validate our approach in both real and synthetic experiments, and demonstrate its advantages with respect to competing methods. More specifically, we show that besides simultaneously retrieving camera pose and non-rigid shape, our approach is adequate for both isometric and extensible surfaces, does not require neither batch processing all the frames nor tracking points over the whole sequence and runs at several frames per second.

A 3D Descriptor to Detect Task-oriented Grasping Points in Clothing 
A.Ramisa, G.Alenyà, F.Moreno-Noguer and C.Torras
Pattern Recognition, 2016

@article{Ramisa_pr2016,
title = {A 3D Descriptor to Detect Task-oriented Grasping Points in Clothing},
author = {A. Ramisa and G. Alenya and F. Moreno-Noguer and C. Torras},
booktitle = {Pattern Recognition},
volume = {60},
number = {C},
issn = {0031-3203},
pages = {936-948},
doi = {10.1016/j.patcog.2016.07.003},
year = {2016}
month = {December}
}

Manipulating textile objects with a robot is a challenging task, especially because the garment perception is difficult due to the endless configurations it can adopt, coupled with a large variety of colors and designs. Most current approaches follow a multiple re-grasp strategy, in which clothes are sequentially grasped from different points until one of them yields a recognizable configuration. In this work we propose a method that combines 3D and appearance information to directly select a suitable grasping point for the task at hand, which in our case consists of hanging a shirt or a polo shirt from a hook. Our method follows a coarse-to-fine approach in which, first, the collar of the garment is detected and, next, a grasping point on the lapel is chosen using a novel 3D descriptor. In contrast to current 3D descriptors, ours can run in real time, even when it needs to be densely computed over the input image. Our central idea is to take advantage of the structured nature of range images that most depth sensors provide and, by exploiting integral imaging, achieve speed-ups of two orders of magnitude with respect to competing approaches, while maintaining performance. This makes it especially adequate for robotic applications as we thoroughly demonstrate in the experimental section.

Real-Time 3D Reconstruction of Non-Rigid Shapes with a Single Moving Camera 
A.Agudo, F.Moreno-Noguer, B.Calvo and J.M.M.Montiel
Computer Vision and Image Understanding (CVIU), 2016

@article{Agudo_cviu2016,
title = {Real-Time 3D Reconstruction of Non-Rigid Shapes with a Single Moving Camera},
author = {A. Agudo and F. Moreno-Noguer and B. Calvo and J.M.M. Montiel},
booktitle = {Computer Vision and Image Understanding},
volume = {153},
issue = {C},
issn = {1077-3142},
pages = {37-54},
doi = {10.1016/j.cviu.2016.05.004},
year = {2016},
month = {December}
}

This paper describes a real-time sequential method to simultaneously recover the camera motion and the 3D shape of deformable objects from a calibrated monocular video. For this purpose, we consider the Navier-Cauchy equations used in 3D linear elasticity and solved by finite elements, to model the time-varying shape per frame. These equations are embedded in an extended Kalman filter, resulting in sequential Bayesian estimation approach. We represent the shape, with unknown material properties, as a combination of elastic elements whose nodal points correspond to salient points in the image. The global rigidity of the shape is encoded by a stiffness matrix, computed after assembling each of these elements. With this piecewise model, we can linearly relate the 3D displacements with the 3D acting forces that cause the object deformation, assumed to be normally distributed. While standard finite-element-method techniques require imposing boundary conditions to solve the resulting linear system, in this work we eliminate this requirement by modeling the compliance matrix with a generalized pseudoinverse that enforces a pre-fixed rank. Our framework also ensures surface continuity without the need for a post-processing step to stitch all the piecewise reconstructions into a global smooth shape. We present experimental results using both synthetic and real videos for different scenarios ranging from isometric to elastic deformations. We also show the consistency of the estimation with respect to 3D ground truth data, include several experiments assessing robustness against artifacts and finally, provide an experimental validation of our performance in real time at frame rate for small maps.

Interactive Multiple Object Learning with Scanty Human Supervision 
M.Villamizar, A.Garrell, A.Sanfeliu and F.Moreno-Noguer
Computer Vision and Image Understanding (CVIU), 2016

@article{Villamizar_cviu2016,
title = {Interactive Multiple Object Learning with Scanty Human Supervision},
author = {M. Villamizar and A. Garrell and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Computer Vision and Image Understanding},
volume = {149},
issue = {C},
issn = {1077-3142},
pages = {51-64},
doi = {10.1016/j.cviu.2016.03.010},
year = {2016},
month = {August}
}

We present a fast and online human-robot interaction approach that progressively learns multiple object classifiers using scanty human supervision. Given an input video stream recorded during the human-robot interaction, the user just needs to annotate a small fraction of frames to compute object specific classifiers based on random ferns which share the same features. The resulting methodology is fast (in a few seconds, complex object appearances can be learned), versatile (it can be applied to unconstrained scenarios), scalable (real experiments show we can model up to 30 different object classes), and minimizes the amount of human intervention by leveraging the uncertainty measures associated to each classifier. We thoroughly validate the approach on synthetic data and on real sequences acquired with a mobile platform in indoor and outdoor scenarios containing a multitude of different objects. We show that with little human assistance, we are able to build object classifiers robust to viewpoint changes, partial occlusions, varying lighting and cluttered backgrounds.

A Bayesian Approach to Simultaneously Recover Camera Pose and Non-Rigid Shape from Monocular Images 
F.Moreno-Noguer and J.Porta
Image and Vision Computing (IVC), 2016

@article{Moreno_ivc2016,
title = {A Bayesian Approach to Simultaneously Recover Camera Pose and Non-Rigid Shape from Monocular Images},
author = {F. Moreno-Noguer and J. Porta},
booktitle = {Image and Vision Computing},
volume = {52},
issn = {0262-8856},
pages = {141-153},
doi = {10.1016/j.imavis.2016.05.012},
year = {2016},
month = {August}
}

In this paper we bring the tools of the Simultaneous Localization and Map Building (SLAM) problem from a rigid to a deformable domain and use them to simultaneously recover the 3D shape of non-rigid surfaces and the sequence of poses of a moving camera. Under the assumption that the surface shape may be represented as a weighted sum of deformation modes, we show that the problem of estimating the modal weights along with the camera poses, can be probabilistically formulated as a maximum a posteriori estimate and solved using an iterative least squares optimization. In addition, the probabilistic formulation we propose is very general and allows introducing different constraints without requiring any extra complexity. As a proof of concept, we show that local inextensibility constraints that prevent the surface from stretching can be easily integrated. An extensive evaluation on synthetic and real data, demonstrates that our method has several advantages over current non-rigid shape from motion approaches. In particular, we show that our solution is robust to large amounts of noise and outliers and that it does not need to track points over the whole sequence nor to use an initialization close from the ground truth.

MSClique: Multiple Structure Discovery through Maximum Weighted Clique Problem 
G.Sanroma, A.Penate-Sanchez, R.Alquezar, F.Serratosa, F.Moreno-Noguer, J.Andrade-Cetto and M.A.Gonzalez Ballester
PLoS ONE, 2016

@article{Sanroma_plosone2016,
title = {MSClique: Multiple Structure Discovery through Maximum Weighted Clique Problem},
author = {G. Sanroma and A. Penate-Sanchez and R. Alquezar and F. Serratosa and F. Moreno-Noguer and J. Andrade-Cetto and M.A. Gonzalez Ballester},
booktitle = {PLoS ONE},
volume = {11},
number = {1}
doi = {https://doi.org/10.1371/journal.pone.0145846},
year = {2016},
month = {January}
}

We present a novel approach for feature correspondence and multiple structure discovery in computer vision. In contrast to existing methods, we exploit the fact that point-sets on the same structure usually lie close to each other, thus forming clusters in the image. Given a pair of input images, we initially extract points of interest and extract hierarchical representations by agglomerative clustering. We use the maximum weighted clique problem to find the set of corresponding clusters with maximum number of inliers representing the multiple structures at the correct scales. Our method is parameter-free and only needs two sets of points along with their tentative correspondences, thus being extremely easy to use. We demonstrate the effectiveness of our method in multiple-structure fitting experiments in both publicly available and in-house datasets. As shown in the experiments, our approach finds a higher number of structures containing fewer outliers compared to state-of-the-art methods.

Conference

Accurate and Linear Time Pose Estimation from Points and Lines  
A.Vakhitov, J.Funke and F.Moreno-Noguer 
European Conference on Computer Vision (ECCV), 2016

@inproceedings{Vakhitov_eccv2016,
title = {Accurate and Linear Time Pose Estimation from Points and Lines},
author = {A. Vakhitov and J. Funke and F. Moreno-Noguer},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2016}
}

The Perspective-n-Point (PnP) problem seeks to estimate the pose of a calibrated camera from n 3D-to-2D point correspondences. There are situations, though, where PnP solutions are prone to fail because feature point correspondences cannot be reliably estimated (e.g. scenes with repetitive patterns or with low texture). In such scenarios, one can still exploit alternative geometric entities, such as lines, yielding the so-called Perspective-n-Line (PnL) algorithms. Unfortunately, existing PnL solutions are not as accurate and efficient as their point-based counterparts. In this paper we propose a novel approach to introduce 3D-to-2D line correspondences into a PnP formulation, allowing to simultaneously process points and lines. For this purpose we introduce an algebraic line error that can be formulated as linear constraints on the line endpoints, even when these are not directly observable. These constraints can then be naturally integrated within the linear formulations of two state-of-the-art point-based algorithms, the OPnP and the EPnP, allowing them to indistinctly handle points, lines, or a combination of them. Exhaustive experiments show that the proposed formulation brings remarkable boost in performance compared to only point or only line based solutions, with a negligible computational overhead compared to the original OPnP and EPnP.

Recovering Pose and 3D Deformable Shape from Multi-Instance Image Ensembles  
A.Agudo and F.Moreno-Noguer 
Asian Conference on Computer Vision (ACCV), 2016

@inproceedings{Agudo_accv2016,
title = {Recovering Pose and 3D Deformable Shape from Multi-Instance Image Ensembles},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
year = {2016}
}

In recent years, there has been a growing interest on tackling the Non-Rigid Structure from Motion problem (NRSfM), where the shape of a deformable object and the pose of a moving camera are simultaneously estimated from a monocular video sequence. Existing solutions are limited to single objects and continuous, smoothly changing sequences. In this paper we extend NRSfM to a multi-instance domain, in which the images do not need to have temporal consistency, allowing for instance, to jointly reconstruct the face of multiple persons from an unordered list of images. For this purpose, we present a new formulation of the problem based on a dual low-rank shape representation, that simultaneously captures the between- and within-individual deformations. The parameters of this model are learned using a variant of the probabilistic linear discriminant analysis that requires consecutive batches of expectation and maximization steps. The resulting approach estimates 3D deformable shape and pose of multiple instances from only 2D point observations on a collection images, without requiring pre-trained 3D data, and is shown to be robust to noisy measurements and missing points. We provide quantitative and qualitatively evaluation on both synthetic and real data, and show consistent benefits compared to current state-of-the-art.

Structured Prediction with Output Embeddings for Semantic Image Annotation  
A.Quattoni, A.Ramisa, P.Swaroop, E.Simo-Serra and F.Moreno-Noguer 
Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2016

@inproceedings{Quattoni_naacl2016,
title = {Structured Prediction with Output Embeddings for Semantic Image Annotation},
author = {A. Quattoni and A. Ramisa and P.Swaroop and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
year = {2016}
}

We address the task of annotating images with semantic tuples. Solving this problem requires an algorithm able to deal with hundreds of classes for each argument of the tuple. In such contexts, data sparsity becomes a key challenge. We propose handling this sparsity by incorporating feature representations of both the inputs (images) and outputs (argument classes) into a factorized log-linear model.

Mode-Shape Interpretation: Re-Thinking Modal Space for Recovering Deformable Shapes  
A.Agudo, F.Moreno-Noguer, B.Calvo and J.M.M.Montiel 
IEEE Winter Conference on Applications of Computer Vision (WACV), 2016

@inproceedings{Agudo_wacv2016,
title = {Mode-Shape Interpretation: Re-Thinking Modal Space for Recovering Deformable Shapes},
author = {A. Agudo and F. Moreno-Noguer and B. Calvo and J.M.M. Montiel},
booktitle = {Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV)},
year = {2016}
}

This paper describes an on-line approach for estimating non-rigid shape and camera pose from monocular video sequences. We assume an initial estimate of the shape at rest to be given and represented by a triangulated mesh, which is encoded by a matrix of the distances between every pair of vertexes. By applying spectral analysis on this matrix, we are then able to compute a low-dimensional shape basis, that in contrast to standard approaches, has a very direct physical interpretation and requires a much smaller number of modes to span a large variety of deformations, either for inextensible or extensible configurations. Based on this low-rank model, we then sequentially retrieve both camera motion and non-rigid shape in each image, optimizing the model parameters with bundle adjustment over a sliding window of image frames. Since the number of these parameters is small, specially when considering physical priors, our approach may potentially achieve real-time performance. Experimental results on real videos for different scenarios demonstrate remarkable robustness to artifacts such as missing and noisy observations.

BASS: Boundary-aware Superpixel Segmentation  
A.Rubio, L.Yu, E.Simo-Serra and F.Moreno-Noguer 
International Conference on Pattern Recognition (ICPR), 2016

@inproceedings{Rubio_icpr2016,
title = {{BASS}: Boundary-aware Superpixel Segmentation},
author = {A. Rubio and L. Yu and E. Simo-Serra and F. Moreno-Noguer},
booktitle = {Proceedings of International Conference on Pattern Recognition (ICPR)},
year = {2016}
}

We propose a new superpixel algorithm based on exploiting the boundary information of an image, as objects in images can generally be described by their boundaries. Our proposed approach initially estimates the boundaries and uses them to place superpixel seeds in the areas in which they are more dense. Afterwards, we minimize an energy function in order to expand the seeds into full superpixels. In addition to standard terms such as color consistency and compactness, we propose using the geodesic distance which concentrates small superpixels in regions of the image with more information, while letting larger superpixels cover more homogeneous regions. By both improving the initialization using the boundaries and coherency of the superpixels with geodesic distances, we are able to maintain the coherency of the image structure with fewer superpixels than other approaches. We show the resulting algorithm to yield smaller Variation of Information metrics in seven different datasets while maintaining Undersegmentation Error values similar to the state-of-the-art methods.

Structured Learning of Assignment Models of Neuron Reconstruction to Minimize Topological Errors  
J.Funke, J.Klein, F.Moreno-Noguer, A.Cardona and M.Cook 
International Symposium on biomedical Imaging (ISBI), 2016

@inproceedings{Funke_isbi2016,
title = {Structured Learning of Assignment Models of Neuron Reconstruction to Minimize Topological Errors},
author = {J. Funke and J. Klein and F. Moreno-Noguer and A. Cardona and M. Cook},
booktitle = {Proceedings of the International Symposium on biomedical Imaging (ISBI)},
year = {2016}
}

Structured learning provides a powerful framework for empirical risk minimization on the predictions of structured models. It allows end-to-end learning of model parameters to minimize an application specific loss function. This framework is particularly well suited for discrete optimization models that are used for neuron reconstruction from anisotropic electron microscopy (EM) volumes. However, current methods are still learning unary potentials by training a classifier that is agnostic about the model it is used in. We believe the reason for that lies in the difficulties of (1) finding a representative training sample, and (2) designing an application specific loss function that captures the quality of a proposed solution. In this paper, we show how to find a representative training sample from human generated ground truth, and propose a loss function that is suitable to minimize topological errors in the reconstruction. We compare different training methods on two challenging EM-datasets. Our structured learning approach shows consistently higher reconstruction accuracy than other current learning methods.

2015

Book Chapter

Dense Segmentation-aware Descriptors 
E.Trulls, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer
Chapter in Dense Image Correspondences for Computer Vision, 2015

@article{Trulls_springerchapter2015,
title = {Dense Segmentation-aware Descriptors},
author = {E. Trulls and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Dense Image Correspondences for Computer Vision},
editor = {Ce Liu and Tal Hassner},
publisher = {Springer},
doi = {http://dx.doi.org/10.1007/978-3-319-23048-1},
year = {2015}
}

Dense descriptors are becoming increasingly popular in a host of tasks, such as dense image correspondence, bag-of-words image classification, and label transfer. However the extraction of descriptors on generic image points, rather than select geometric features, e.g. blobs, requires rethinking how to achieve invariance to nuisance parameters. In this work we pursue invariance to occlusions and background changes by introducing segmentation information within dense feature construction. The core idea is to use the segmentation cues to downplay the features coming from image areas that are unlikely to belong to the same region as the feature point. We show how to integrate this idea with dense SIFT, as well as with the dense Scale- and Rotation-Invariant Descriptor (SID). We thereby deliver dense descriptors that are invariant to background changes, rotation and/or scaling. We explore the merit of our technique in conjunction with large displacement motion estimation and wide-baseline stereo, and demonstrate that exploiting segmentation information yields clear improvements.

Journal

Non-Rigid Graph Registration using Active Testing Search 
E.Serradell, M.A.Pinheiro, R.Sznitman, J.Kybic, F.Moreno-Noguer and P.Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2015

@article{Serradell_pami2015,
title = {Non-Rigid Graph Registration using Active Testing Search},
author = {E. Serradell and M.A. Pinheiro and R. Sznitman and J. Kybic and F. Moreno-Noguer and P. Fua},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {37},
number = {3},
issn = {0162-8828},
pages = {625-638},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2014.2343235},
year = {2015},
month = {March}
}

We present a new approach for matching sets of branching curvilinear structures that form graphs embedded in R2 or R3 and may be subject to deformations. Unlike earlier methods, ours does not rely on local appearance similarity nor does require a good initial alignment. Furthermore, it can cope with non-linear deformations, topological differences, and partial graphs. To handle arbitrary non-linear deformations, we use Gaussian Processes to represent the geometrical mapping relating the two graphs. In the absence of appearance information, we iteratively establish correspondences between points, update the mapping accordingly, and use it to estimate where to find the most likely correspondences that will be used in the next step. To make the computation tractable for large graphs, the set of new potential matches considered at each iteration is not selected at random as in many RANSAC-based algorithms. Instead, we introduce a so-called Active Testing Search strategy that performs a priority search to favor the most likely matches and speed-up the process. We demonstrate the effectiveness of our approach first on synthetic cases and then on angiography data, retinal fundus images, and microscopy image stacks acquired at very different resolutions.

DaLI: Deformation and Light Invariant Descriptor
E.Simo-Serra, C.Torras and F.Moreno-Noguer
International Journal of Computer Vision (IJCV), 2015

@article{Simo_ijcv2015,
title = {{DaLI}: Deformation and Light Invariant Descriptor},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {International Journal of Computer Vision (IJCV)},
volume = {115},
number = {2},
issn = {0920-5691},
pages = {135-154},
doi = {https://doi.org/10.1007/s11263-015-0805-1},
year = {2015},
month = {November}
}

Recent advances in 3D shape analysis and recognition have shown that heat diffusion theory can be effectively used to describe local features of deforming and scaling surfaces. In this paper, we show how this description can be used to characterize 2D image patches, and introduce DaLI, a novel feature point descriptor with high resilience to non-rigid image transformations and illumination changes. In order to build the descriptor, 2D image patches are initially treated as 3D surfaces. Patches are then described in terms of a heat kernel signature, which captures both local and global information, and shows a high degree of invariance to non-linear image warps. In addition, by further applying a logarithmic sampling and a Fourier transform, invariance to photometric changes is achieved. Finally, the descriptor is compacted by mapping it onto a low dimensional subspace computed using Principal Component Analysis, allowing for an efficient matching. A thorough experimental validation demonstrates that DaLI is significantly more discriminative and robust to illuminations changes and image transformations than state of the art descriptors, even those specifically designed to describe non-rigid deformations.

Conference

Discriminative Learning of Deep Convolutional Feature Point Descriptors  
E.Simo-Serra, E.Trulls, L.Ferraz, I.Kokkinos, P.Fua and F.Moreno-Noguer 
International Conference in Computer Vision (ICCV), 2015

@inproceedings{Simo_iccv2015,
title = {Discriminative Learning of Deep Convolutional Feature Point Descriptors},
author = {E. Simo-Serra and E. Trulls and L. Ferraz and I. Kokkinos and P. Fua and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2015}
}

Deep learning has revolutionalized image-level tasks such as classification, but patch-level tasks, such as correspondence, still rely on handcrafted features, e.g. SIFT. In this paper we use Convolutional Neural Networks (CNNs) to learn discriminant patch representations and in particular train a Siamese network with pairs of (non-)corresponding patches. We deal with the large number of potential pairs with the combination of a stochastic sampling of the training set and an aggressive mining strategy biased towards patches that are hard to classify. By using the L2 distance during both training and testing we develop 128-D descriptors whose euclidean distances reflect patch similarity, and which can be used as a drop-in replacement for any task involving SIFT. We demonstrate consistent performance gains over the state of the art, and generalize well against scaling and rotation, perspective transformation, non-rigid deformation, and illumination changes. Our descriptors are efficient to compute and amenable to modern GPUs, and are publicly available.

Learning Shape, Motion and Elastic Models in Force Space  
A.Agudo and F.Moreno-Noguer 
International Conference in Computer Vision (ICCV), 2015

@inproceedings{Agudo_iccv2015,
title = {Learning Shape, Motion and Elastic Models in Force Space},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)},
year = {2015}
}

In this paper, we address the problem of simultaneously recovering the 3D shape and pose of a deformable and potentially elastic object from 2D motion. This is a highly ambiguous problem typically tackled by using low-rank shape and trajectory constraints. We show that formulating the problem in terms of a low-rank force space that induces the deformation, allows for a better physical interpretation of the resulting priors and a more accurate representation of the actual object’s behavior. However, this comes at the price of, besides force and pose, having to estimate the elastic model of the object. For this, we use an Expectation Maximization strategy, where each of these parameters are successively learned within partial M-steps, while robustly dealing with missing observations. We thoroughly validate the approach on both mocap and real sequences, showing more accurate 3D reconstructions than state-of-the-art, and additionally providing an estimate of the full elastic model with no a priori information.

Simultaneous Pose and Non-rigid Shape with Particle Dynamics  
A.Agudo and F.Moreno-Noguer 
Conference on Computer Vision and Pattern Recognition (CVPR), 2015

@inproceedings{Agudo_cvpr2015,
title = {Simultaneous Pose and Non-rigid Shape with Particle Dynamics},
author = {A. Agudo and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2015}
}

In this paper, we propose a sequential solution to simultaneously estimate camera pose and non-rigid 3D shape from a monocular video. In contrast to most existing approaches that rely on global representations of the shape, we model the object at a local level, as an ensemble of particles, each ruled by the linear equation of the Newton's second law of motion. This dynamic model is incorporated into a bundle adjustment framework, in combination with simple regularization components that ensure temporal and spatial consistency of the estimated shape and camera poses. The resulting approach is both efficient and robust to several artifacts such as noisy and missing data or sudden camera motions, while it does not require any training data at all. Validation is done in a variety of real video sequences, including articulated and non-rigid motion, both for continuous and discontinuous shapes. Our system is shown to perform comparable to competing batch, computationally expensive, methods and shows remarkable improvement with respect to the sequential ones.

Neuroaesthetics in Fashion: High Performance CRF Model for Cloth Parsing  
E.Simo-Serra, S.Fidler, F.Moreno-Noguer and R.Urtasun 
Conference on Computer Vision and Pattern Recognition (CVPR), 2015

@inproceedings{Simo_cvpr2015,
title = {Neuroaesthetics in Fashion: High Performance CRF Model for Cloth Parsing},
author = {E. Simo-Serra and S. Fidler and F. Moreno-Noguer and R. Urtasun},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2015}
}

In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph’s setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.

Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions 
A.Ramisa, J.Wang, Y.Lu, E.Dellandrea, F.Moreno-Noguer and R.Gaizauskas 
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015

@inproceedings{Ramisa_emnlp2015,
title = {Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions},
author = {A. Ramisa and J. Wang and Y. Lu and E. Dellandrea and F. Moreno-Noguer and R. Gaizauskas},
booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year = {2015}
}

We investigate the role that geometric, textual and visual features play in the task of predicting a preposition that links two visual entities depicted in an image. The task is an important part of the subsequent process of generating image descriptions. We explore the prediction of prepositions for a pair of entities, both in the case when the labels of such entities are known and unknown. In all situations we found clear evidence that all three features contribute to the prediction task.

Matchability Prediction for Full-Search Template Matching Algorithms  
A.Penate, L.Porzi and F.Moreno-Noguer 
International Conference on 3D Vision (3DV), 2015

@inproceedings{Penate_3dv2015,
title = {Matchability Prediction for Full-Search Template Matching Algorithms},
author = {A. Penate and L. Porzi and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on 3D Vision (3DV)},
year = {2015}
}

While recent approaches have shown that it is possible to do template matching by exhaustively scanning the parameter space, the resulting algorithms are still quite demanding. In this paper we alleviate the computational load of these algorithms by proposing an efficient approach for predicting the matchability of a template, before it is actually performed. This avoids large amounts of unnecessary computations. We learn the matchability of templates by using dense convolutional neural network descriptors that do not require ad-hoc criteria to characterize a template. By using deep learning descriptions of patches we are able to predict matchability over the whole image quite reliably. We will also show how no specific training data is required to solve problems like panorama stitching in which you usually require data from the scene in question. Due to the highly parallelizable nature of this tasks we offer an efficient technique with a negligible computational cost at test time.

Lie Algebra-Based Kinematic Prior for 3D Human Pose Tracking   (Best Paper Award)
E.Simo-Serra, C.Torras and F.Moreno-Noguer 
International Conference on Machine Vision Applications (MVA), 2015

@inproceedings{Simo_mva2015,
title = {Lie Algebra-Based Kinematic Prior for 3D Human Pose Tracking},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Machine Vision Applications (MVA)},
year = {2015}
}

We propose a novel kinematic prior for 3D human pose tracking that allows predicting the position in subsequent frames given the current position. We first define a Riemannian manifold that models the pose and extend it with its Lie algebra to also be able to represent the kinematics. We then learn a joint Gaussian mixture model of both the human pose and the kinematics on this manifold. Finally by conditioning the kinematics on the pose we are able to obtain a distribution of poses for subsequent frames that which can be used as a reliable prior in 3D human pose tracking. Our model scales well to large amounts of data and can be sampled at over 100,000 samples/second. We show it outperforms the widely used Gaussian diffusion model on the challenging Human3.6M dataset.

Multimodal Object Classification using Random Clustering Trees   (Best Poster Award)
M.Villamizar, A.Garrell, A.Sanfeliu and F.Moreno-Noguer 
Iberian Conference on Pattern Recognition and Image Analysis (IBPRIA), 2015

@inproceedings{Villamizar_ibpria2015,
title = {Multimodal Object Classification using Random Clustering Trees},
author = {M. Villamizar and A. Garrell and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Iberian Conference on Pattern Recognition and Image Analysis (IBPRIA)},
year = {2015}
}

In this paper, we present an object recognition approach that in addition allows to discover intra-class modalities exhibiting high-correlated visual information. Unlike to more conventional approaches based on computing multiple specialized classifiers, the proposed approach combines a single classifier, Boosted Random Ferns (BRFs), with probabilistic Latent Semantic Analysis (pLSA) in order to recognize an object class and to and automatically the most prominent intra-class appearance modalities (clusters) through tree-structured visual words. The proposed approach has been validated in synthetic and real experiments where we show that the method is able to recognize objects with multiple appearances.

Modeling Robot’s World with Minimal Effort  
M.Villamizar, A.Garrell, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Robotics and Automation (ICRA), 2015

@inproceedings{Villamizar_icra2015,
title = {Modeling Robot’s World with Minimal Effort},
author = {M. Villamizar and A. Garrell and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2015}
}

We propose an efficient Human Robot Interaction approach to efficiently model the appearance of all relevant objects in robot’s environment. Given an input video stream recorded while the robot is navigating, the user just needs to annotate a very small number of frames to build specific classifiers for each of the objects of interest. At the core of the method, there are several random ferns classifiers that share the same features and are updated online. The resulting methodology is fast (runs at 8 fps), versatile (it can be applied to unconstrained scenarios), scalable (real experiments show we can model up to 30 different object classes), and minimizes the amount of human intervention by leveraging the uncertainty measures associated to each classifier. We thoroughly validate the approach on synthetic data and on real sequences acquired with a mobile platform in outdoor and challenging scenarios containing a multitude of different objects. We show that the human can, with minimal effort, provide the robot with a detailed model of the objects in the scene.

Efficient Monocular Pose Estimation for Complex 3D Models  
A.Rubio, M.Villamizar, L.Ferraz, A.Penate-Sanchez, A.Ramisa, E.Simo-Serra, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Robotics and Automation (ICRA), 2015

@inproceedings{Rubio_icra2015,
title = {Efficient Monocular Pose Estimation for Complex 3D Models},
author = {A. Rubio and M. Villamizar and L. Ferraz and A. Penate-Sanchez and A. Ramisa and E. Simo-Serra and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2015}
}

We propose a robust and efficient method to estimate the pose of a camera with respect to complex 3D textured models of the environment that can potentially contain more than 100,000 points. To tackle this problem we follow a top down approach where we combine high-level deep network classifiers with low level geometric approaches to come up with a solution that is fast, robust and accurate. Given an input image, we initially use a pre-trained deep network to compute a rough estimation of the camera pose. This initial estimate constrains the number of 3D model points that can be seen from the camera viewpoint. We then establish 3D-to-2D correspondences between these potentially visible points of the model and the 2D detected image features. Accurate pose estimation is finally obtained from the 2D-to-3D correspondences using a novel PnP algorithm that rejects outliers without the need to use a RANSAC strategy, and which is between 10 and 100 times faster than other methods that use it. Two real experiments dealing with very large and complex 3D models demonstrate the effectiveness of the approach.

Workshop

Semantic Tuples for Evaluation of Image Sentence Generation  
L.Ellebracht, A.Ramisa, P.Swaroop, J.Cordero-Rama, F.Moreno-Noguer and A.Quattoni 
Vision and Language Workshop (in EMNLP), 2015

@inproceedings{Ellebracht_vl2015,
title = {Semantic Tuples for Evaluation of Image Sentence Generation},
author = {L. Ellebracht and A. Ramisa and P. Swaroop and J. Cordero-Rama and F. Moreno-Noguer and A. Quattoni},
booktitle = {Vision and Language Workshop (in EMNLP)},
year = {2015}
}

The automatic generation of image captions has received considerable attention. The problem of evaluating caption gener- ation systems, though, has not been that much explored. We propose a novel evaluation approach based on comparing the underlying visual semantics of the candidate and ground-truth captions. With this goal in mind we have defined a semantic representation for visually descriptive language and have augmented a subset of the Flickr-8K dataset with semantic annotations. Our evaluation metric (BAST) can be used not only to compare systems but also to do error analysis and get a better understanding of the type of mistakes a system does. To compute BAST we need to predict the semantic representation for the automatically generated captions. We use the Flickr-ST dataset to train classifiers that predict STs so that evaluation can be fully automated.

2014

Journal

Learning RGB-D Descriptors of Garment Parts for Informed Grasping
A.Ramisa, G.Alenya, F.Moreno-Noguer and C.Torras
Engineering Applications of Artificial Intelligence (EEAI), 2014

@article{Ramisa_eeai2014,
title = {Learning RGB-D Descriptors of Garment Parts for Informed Grasping},
author = {A. Ramisa and G. Alenya and F. Moreno-Noguer and C. Torras},
booktitle = {Engineering Applications of Artificial Intelligence (EEAI)},
volume = {35},
issn = {0952-1976},
pages = {246-258},
doi = {10.1016/j.engappai.2014.06.025},
year = {2014},
month = {October}
}

Robotic handling of textile objects in household environments is an emerging application that has recently received considerable attention thanks to the development of domestic robots. Most current approaches follow a multiple re-grasp strategy for this purpose, in which clothes are sequentially grasped from different points until one of them yields a desired configuration. In this work we propose a vision-based method, built on the Bag of Visual Words approach, that combines appearance and 3D information to detect parts suitable for grasping in clothes, even when they are highly wrinkled. We also contribute a new, annotated, garment part dataset that can be used for benchmarking classification, part detection, and segmentation algorithms. The dataset is used to evaluate our approach and several state-of-the-art 3D descriptors for the task of garment part detection. Results indicate that appearance is a reliable source of information, but that augmenting it with 3D information can help the method perform better with new clothing items.

Conference

Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection  
L.Ferraz, X.Binefa and F.Moreno-Noguer 
Conference on Computer Vision and Pattern Recognition (CVPR), 2014

@inproceedings{Ferraz_cvpr2014,
title = {Very Fast Solution to the PnP Problem with Algebraic Outlier Rejection},
author = {L. Ferraz and X. Binefa and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {501-508},
year = {2014}
}

We propose a real-time, robust to outliers and accurate solution to the Perspective-n-Point (PnP) problem. The main advantages of our solution are twofold: first, it integrates the outlier rejection within the pose estimation pipeline with a negligible computational overhead; and second, its scalability to arbitrarily large number of correspondences. Given a set of 3D-to-2D matches, we formulate pose estimation problem as a low-rank homogeneous system where the solution lies on its 1D null space. Outlier correspondences are those rows of the linear system which perturb the null space and are progressively detected by projecting them on an iteratively estimated solution of the null space. Since our outlier removal process is based on an algebraic criterion which does not require computing the full-pose and reprojecting back all 3D points on the image plane at each step, we achieve speed gains of more than 100× compared to RANSAC strategies. An extensive experimental evaluation will show that our solution yields accurate results in situations with up to 50% of outliers, and can process more than 1000 correspondences in less than 5ms.

Segmentation-aware Deformable Part Models  
E.Trulls, S.Tsogkas, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer 
Conference on Computer Vision and Pattern Recognition (CVPR), 2014

@inproceedings{Trulls_cvpr2014,
title = {Segmentation-aware Deformable Part Models},
author = {E. Trulls and S. Tsogkas and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {168-175}
year = {2014}
}

In this work we propose a technique to combine bottom-up segmentation, coming in the form of SLIC superpixels, with sliding window detectors, such as Deformable Part Models (DPMs). The merit of our approach lies in ‘cleaning up’ the low-level HOG features by exploiting the spatial support of SLIC superpixels; this can be understood as using segmentation to split the feature variation into object-specific and background changes. Rather than committing to a single segmentation we use a large pool of SLIC superpixels and combine them in a scale-, position- and object-dependent manner to build soft segmentation masks. The segmentation masks can be computed fast enough to repeat this process over every candidate window, during training and detection, for both the root and part filters of DPMs. We use these masks to construct enhanced, background-invariant features to train DPMs. We test our approach on the PASCAL VOC 2007, outperforming the standard DPM in 17 out of 20 classes, yielding an average increase of 1.7% AP. Additionally, we demonstrate the robustness of this approach, extending it to dense SIFT descriptors for large displacement optical flow.

A High Performance CRF Model for Cloth Parsing  
E.Simo-Serra, S.Fidler, F.Moreno-Noguer and R.Urtasun 
Asian Conference on Computer Vision (ACCV), 2014

@inproceedings{Simo_accv2014,
title = {A High Performance CRF Model for Cloth Parsing},
author = {E. Simo-Serra and S. Fidler and F. Moreno-Noguer and R. Urtasun},
booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
year = {2014}
}

In this paper we tackle the problem of clothing parsing: Our goal is to segment and classify different garments a person is wearing. We frame the problem as the one of inference in a pose-aware Conditional Random Field (CRF) which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments, and symmetries between different human body parts. We demonstrate the effectiveness of our approach on the Fashionista dataset and show that we can obtain a significant improvement over the state-of-the-art.

LETHA: Learning from High Quality Inputs for 3D Pose Estimation in Low Quality Images  
A.Penate, F.Moreno-Noguer, J.Andrade and F.Fleuret 
International Conference on 3D Vision (3DV), 2014

@inproceedings{Penate_3dv2014,
title = {LETHA: Learning from High Quality Inputs for 3D Pose Estimation in Low Quality Images},
author = {A. Penate and F. Moreno-Noguer and J. Andrade and F. Fleuret},
booktitle = {Proceedings of the International Conference on 3D Vision (3DV)},
year = {2014}
}

We introduce LETHA (Learning on Easy data, Test on Hard), a new learning paradigm consisting of building strong priors from high quality training data, and combining them with discriminative machine learning to deal with low-quality test data. Our main contribution is an implementation of that concept for pose estimation. We first automatically build a 3D model of the object of interest from high-definition images, and devise from it a pose-indexed feature extraction scheme. We then train a single classifier to process these feature vectors. Given a low quality test image, we visit many hypothetical poses, extract features consistently and evaluate the response of the classifier. Since this process uses locations recorded during learning, it does not require matching points anymore. We use a boosting procedure to train this classifier common to all poses, which is able to deal with missing features, due in this context to self-occlusion. Our results demonstrate that the method combines the strengths of global image representations, discriminative even for very tiny images, and the robustness to occlusions of approaches based on local feature point descriptors.

Leveraging Feature Uncertainty in the PnP Problem  
L.Ferraz, X.Binefa and F.Moreno-Noguer 
British Machine Vision Conference (BMVC), 2014

@inproceedings{Ferraz_bmvc2014,
title = {Leveraging Feature Uncertainty in the PnP Problem},
author = {L. Ferraz and X. Binefa and F. Moreno-Noguer},
booktitle = {Proceedings of the British Machine Vision Conference (BMVC)},
year = {2014}
}

We propose a real-time and accurate solution to the Perspective-n-Point (PnP) problem –estimating the pose of a calibrated camera from n 3D-to-2D point correspondences– that exploits the fact that in practice the 2D position of not all 2D features is estimated with the same accuracy. Assuming a model of such feature uncertainties is known in advance, we reformulate the PnP problem as a maximum likelihood minimization approximated by an unconstrained Sampson error function, which naturally penalizes the most noisy correspondences. The advantages of this approach are thoroughly demonstrated in synthetic experiments where feature uncertainties are exactly known. Pre-estimating the features uncertainties in real experiments is, though, not easy. In this paper we model feature uncertainty as 2D Gaussian distributions representing the sensitivity of the 2D feature detectors to different camera viewpoints. When using these noise models with our PnP formulation we still obtain promising pose estimation results that outperform the most recent approaches.

Geodesic Finite Mixture Models  
E.Simo-Serra, C.Torras and F.Moreno-Noguer 
British Machine Vision Conference (BMVC), 2014

@inproceedings{Simo_bmvc2014,
title = {Geodesic Finite Mixture Models},
author = {E. Simo-Serra and C. Torras and F. Moreno-Noguer},
booktitle = {Proceedings of the British Machine Vision Conference (BMVC)},
year = {2014}
}

We present a novel approach for learning a finite mixture model on a Riemannian manifold in which Euclidean metrics are not applicable and one needs to resort to geodesic distances consistent with the manifold geometry. For this purpose, we draw inspiration on a variant of the expectation-maximization algorithm, that uses a minimum message length criterion to automatically estimate the optimal number of components from multivariate data lying on an Euclidean space. In order to use this approach on Riemannian manifolds, we propose a formulation in which each component is defined on a different tangent space, thus avoiding the problems associated with the loss of accuracy produced when linearizing the manifold with a single tangent space. Our approach can be applied to any type of manifold for which it is possible to estimate its tangent space. In particular, we show results on synthetic examples of a sphere and a quadric surface, and on a large and complex dataset of human poses, where the proposed model is used as a regression tool for hypothesizing the geometry of occluded parts of the body.

On-board Real-time Pose Estimation for UAVs using Deformable Visual Contour Registration  
A.Amor-Martinez, A.Ruiz, F.Moreno-Noguer and A.Sanfeliu 
International Conference on Robotics and Automation (ICRA), 2014

@inproceedings{Amor_icra2014,
title = {On-board Real-time Pose Estimation for UAVs using Deformable Visual Contour Registration},
author = {A. Amor-Martinez and A. Ruiz and F. Moreno-Noguer and A. Sanfeliu},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2014}
}

We present a real time algorithm for estimating the pose of non-planar objects on which we have placed a visual marker. It is designed to overcome the limitations of small aerial robots, such as slow CPUs, low image resolution and geometric distortions produced by wide angle lenses or viewpoint changes. The method initially registers the shape of a known marker to the contours extracted in an image. For this purpose, and in contrast to state-of-the art, we do not seek to match textured patches or points of interest. Instead, we optimize a geometric alignment cost computed directly from raw polygonal representations of the observed regions using very simple and efficient clipping algorithms. Further speed is achieved by performing the optimization in the polygon representation space, avoiding the need of 2D image processing operations. Deformation modes are easily included in the optimization scheme, allowing an accurate registration of different markers attached to curved surfaces using a single deformable prototype. Once this initial registration is solved, the object pose is retrieved using a standard PnP approach. As a result, the method achieves accurate object pose estimation in real-time, which is very important for interactive UAV tasks, for example for short distance surveillance or bar assembly. We present experiments where our method yields, at about 30Hz, an average error of less than 5mm in estimating the position of a 19x19mm marker placed at 0.7m of the camera.

Fast Online Learning and Detection of Natural Landmarks for Autonomous Aerial Robots  
M.Villamizar, A.Sanfeliu and F.Moreno-Noguer 
International Conference on Robotics and Automation (ICRA), 2014

@inproceedings{Villamizar_icra2014,
title = {Fast Online Learning and Detection of Natural Landmarks for Autonomous Aerial Robots},
author = {M. Villamizar and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the International Conference on Robotics and Automation (ICRA)},
year = {2014}
}

We present a method for efficiently detecting natural landmarks that can handle scenes with highly repetitive patterns and targets progressively changing its appearance. At the core of our approach lies a Random Ferns classifier, that models the posterior probabilities of different views of the target using multiple and independent Ferns, each containing features at particular positions of the target. A Shannon entropy measure is used to pick the most informative locations of these features. This minimizes the number of Ferns while maximizing its discriminative power, allowing thus, for robust detections at low computational costs. In addition, after offline initialization, the new incoming detections are used to update the posterior probabilities on the fly, and adapt to changing appearances that can occur due to the presence of shadows or occluding objects. All these virtues, make the proposed detector appropriate for UAV navigation. Besides the synthetic experiments that will demonstrate the theoretical benefits of our formulation, we will show applications for detecting landing areas in regions with highly repetitive patterns, and specific objects under the presence of cast shadows or sudden camera motions.

Workshop

Efficient Monocular 3D Pose Estimation using Complex 3D Models   (Best Paper Award)
A.Rubio, M.Villamizar, L.Ferraz, A.Penate, A.Sanfeliu and F.Moreno-Noguer 
Jornadas de Automatica, 2014

@inproceedings{Rubio_jornadas2014,
title = {Efficient Monocular 3D Pose Estimation using Complex 3D Models},
author = {A. Rubio and M. Villamizar and L. Ferraz and A. Penate and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Jornadas de Automatica},
year = {2014}
}

We present a method for efficiently detecting natural landmarks that can handle scenes with highly repetitive patterns and targets progressively changing its appearance. At the core of our approach lies a Random Ferns classifier, that models the posterior probabilities of different views of the target using multiple and independent Ferns, each containing features at particular positions of the target. A Shannon entropy measure is used to pick the most informative locations of these features. This minimizes the number of Ferns while maximizing its discriminative power, allowing thus, for robust detections at low computational costs. In addition, after offline initialization, the new incoming detections are used to update the posterior probabilities on the fly, and adapt to changing appearances that can occur due to the presence of shadows or occluding objects. All these virtues, make the proposed detector appropriate for UAV navigation. Besides the synthetic experiments that will demonstrate the theoretical benefits of our formulation, we will show applications for detecting landing areas in regions with highly repetitive patterns, and specific objects under the presence of cast shadows or sudden camera motions.

2013

Journal

Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation 
A.Penate-Sanchez, J.Andrade-Cetto and F.Moreno-Noguer
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013

@article{Penate_pami2013,
title = {Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation},
author = {A. Penate-Sanchez and J. Andrade-Cetto and F. Moreno-Noguer},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {35},
number = {10},
issn = {0162-8828},
pages = {2387-2400},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.36},
year = {2013},
month = {October}
}

We present a general approach for solving the point-cloud matching problem for the case of mildly nonlinear transformations. Our method quickly finds a coarse approximation of the solution by exploring a reduced set of partial matches using an approach to which we refer to as Active Testing Search (ATS). We apply the method to registration of graph structures by branching point matching. It is based solely on the geometric position of the points, no additional information is used nor the knowledge of an initial alignment. In the second stage, we use dynamic programming to refine the solution. We tested our algorithm on angiography, retinal fundus, and neuronal data gathered using electron and light microscopy. We show that our method solves cases not solved by most approaches, and is faster than the remaining ones.

Stochastic Exploration of Ambiguities for Nonrigid Shape Recovery 
F.Moreno-Noguer and P.Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2013

@article{Moreno_pami2013,
title = {Stochastic Exploration of Ambiguities for Nonrigid Shape Recovery},
author = {F. Moreno-Noguer and P. Fua},
booktitle = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
volume = {35},
number = {2},
issn = {0162-8828},
pages = {463-475},
doi = {http://doi.ieeecomputersociety.org/10.1109/TPAMI.2012.102},
year = {2013},
month = {February}
}

Recovering the 3D shape of deformable surfaces from single images is known to be a highly ambiguous problem because many different shapes may have very similar projections. This is commonly addressed by restricting the set of possible shapes to linear combinations of deformation modes and by imposing additional geometric constraints. Unfortunately, because image measurements are noisy, such constraints do not always guarantee that the correct shape will be recovered. To overcome this limitation, we introduce a stochastic sampling approach to efficiently explore the set of solutions of an objective function based on point correspondences. This allows to propose a small set of ambiguous candidate 3D shapes and then use additional image information to choose the best one. As a proof of concept, we use either motion or shading cues to this end and show that we can handle a complex objective function without having to solve a difficult non-linear minimization problem. The advantages of our method are demonstrated on a variety of problems including both real and synthetic data.

Conference

Dense Segmentation-Aware Descriptors  
E.Trulls, I.Kokkinos, A.Sanfeliu and F.Moreno-Noguer 
Conference on Computer Vision and Pattern Recognition (CVPR), 2013

@inproceedings{Trulls_cvpr2013,
title = {Dense Segmentation-Aware Descriptors},
author = {E. Trulls and I. Kokkinos and A. Sanfeliu and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {2890-2897}
year = {2013}
}

In this work we exploit segmentation to construct appearance descriptors that can robustly deal with occlusion and background changes. For this, we downplay measurements coming from areas that are unlikely to belong to the same region as the descriptor’s center, as suggested by soft segmentation masks. Our treatment is applicable to any image point, i.e. dense, and its computational overhead is in the order of a few seconds. We integrate this idea with Dense SIFT, and also with Dense Scale and Rotation Invariant Descriptors (SID), delivering descriptors that are densely computable, invariant to scaling and rotation, and robust to background changes. We apply our approach to standard benchmarks on large displacement motion estimation using SIFT-flow and wide-baseline stereo, systematically demonstrating that the introduction of segmentation yields clear improvements.

A Joint Model for 2D and 3D Pose Estimation from a Single Image  
E.Simo-Serra, A.Quattoni, C.Torras and F.Moreno-Noguer 
Conference on Computer Vision and Pattern Recognition (CVPR), 2013

@inproceedings{Simo_cvpr2013,
title = {A Joint Model for 2D and 3D Pose Estimation from a Single Image},
author = {E. Simo-Serra and A. Quattoni and C. Torras and F. Moreno-Noguer},
booktitle = {Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)},
pages = {3634-3641}
year = {2013}
}

We introduce a novel approach to automatically recover 3D human pose from a single image. Most previous work follows a pipelined approach: initially, a set of 2D features such as edges, joints or silhouettes are detected in the image, and then these observations are used to infer the 3D pose. Solving these two problems separately may lead to erroneous 3D poses when the feature detector has performed poorly. In this paper, we address this issue by jointly solving both the 2D detection and the 3D inference problems. For this purpose, we propose a Bayesian framework that integrates a generative model based on latent variables and discriminative 2D part detectors based on HOGs, and perform inference using evolutionary algorithms. Real experimentation demonstrates competitive results, and the ability of our methodology to provide accurate 2D and 3D pose estimations even when the 2D detectors are inaccurate.

Simultaneous Pose, Focal Length and 2D-to-3D Correspondences from Noisy Observations  
A.Penate-Sanchez, E.Serradell, J.Andrade-Cetto and F.Moreno-Noguer 
British Machine Vision Conference (BMVC), 2013

@inproceedings{Penate_bmvc2013,
title = {Simultaneous Pose, Focal Length and 2D-to-3D Correspondences from Noisy Observations},
author = {A. Penate-Sanchez and E. Serradell and J. Andrade-Cetto and F. Moreno-Noguer},
booktitle = {Proceedings of the British Machine Vision Conference (BMVC)},
pages = {3634-3641}
year = {2013}
}

Simultaneously recovering the camera pose and correspondences between a set of 2D-image and 3D-model points is a difficult problem, especially when the 2D-3D matches cannot be established based on appearance only. The problem becomes even more challenging when input images are acquired with an uncalibrated camera with varying zoom, which yields strong ambiguities between translation and focal length. We present a solution to this problem using only geometrical information. Our approach owes its robustness to an initial stage in which the joint pose and focal length solution space is split into several Gaussian regions. At runtime, each of these regions is explored using an hypothesize-and-test approach, in which the potential number of 2D-3D matches is progressively reduced using informed search through Kalman updates, iteratively refining the pose and focal length parameters. The technique is exhaustive but efficient, significantly improving previous methods in terms of robustness to outliers and noise.

Active Testing Search for Point Cloud Matching  
M.Pinheiro, R.Sznitman, E.Serradell, J.Kybic, F.Moreno-Noguer and P.Fua 
Information Processing in Medical Imaging (IPMI), 2013

@inproceedings{Pinheiro_ipmi2013,
title = {Active Testing Search for Point Cloud Matching},
author = {M. Pinheiro and R. Sznitman and E. Serradell and J. Kybic and F. Moreno-Noguer and P. Fua},
booktitle = {Proceedings of the Information Processing in Medical Imaging (IPMI)},
pages = {3634-3641}
year = {2013}
}

Simultaneously recovering the camera pose and correspondences between a set of 2D-image and 3D-model points is a difficult problem, especially when the 2D-3D matches cannot be established based on appearance only. The problem becomes even more challenging when input images are acquired with an uncalibrated camera with varying zoom, which yields strong ambiguities between translation and focal length. We present a solution to this problem using only geometrical information. Our approach owes its robustness to an initial stage in which the joint pose and focal length solution space is split into several Gaussian regions. At runtime, each of these regions is explored using an hypothesize-and-test approach, in which the potential number of 2D-3D matches is progressively reduced using informed search through Kalman updates, iteratively refining the pose and focal length parameters. The technique is exhaustive but efficient, significantly improving previous methods in terms of robustness to outliers and noise.