PhD Thesis

Enhancing low-level features with mid-level cues

Work default illustration

Information

  • Started: 03/03/2009
  • Finished: 20/02/2015

Docs

Description

Local features have become an essential tool in visual recognition. Much of the progress in computer vision over the past decade has built on simple, local representations such as SIFT or HOG. SIFT in particular shifted the paradigm in feature representation. Subsequent works have often focused on improving either computational efficiency, or invariance properties.

This thesis arguably belongs to the latter group. Invariance is a particularly relevant aspect if we intend to work with dense features (extracted for every pixel in an image). The traditional approach to sparse matching is to rely on stable interest points, such as corners, where scale and orientation can be reliably estimated, thus enforcing invariance. This is not applicable to dense features, which need to be computed on arbitrary points. Dense features have been shown to outperform sparse matching techniques in many recognition problems, and form the bulk of our work.

In this thesis we present strategies to enhance low-level, local features with mid-level, global cues. We devise techniques to construct better features, and use them to handle complex ambiguities, occlusions and background changes. To deal with ambiguities, we explore the use of motion to enforce temporal consistency with optical flow priors. We also introduce a novel technique to exploit segmentation cues, and use it to extract features invariant to background variability. For this, we downplay image measurements most likely to belong to a region different from that where the descriptor is computed. In both cases we follow the same strategy: we incorporate mid-level, 'big picture' information into the construction of local features, and proceed to use them in the same manner as we would the baseline features.

We apply these techniques to different feature representations, including SIFT and HOG, and use them to address canonical vision problems such as stereo and object detection, demonstrating that the introduction of global cues yields consistent improvements. We prioritize solutions that are simple, general, and efficient.

Our main contributions are as follows:
(a) An approach to dense stereo reconstruction with spatiotemporal features, which unlike existing works remains applicable to wide baselines.
(b) A technique to exploit segmentation cues to construct dense descriptors invariant to background variability, such as occlusions or background motion.
(c) A technique to integrate bottom-up segmentation with recognition efficiently, amenable to sliding window detectors.

The work is under the scope of the following projects:

  • PAU+: Perception and Action in Robotics Problems with Large State Spaces (web)
  • ARCAS: Aerial Robotics Cooperative Assembly System (web)
  • ViSen: Visual Sense, Tagging visual data with semantic descriptions (web)