Publication

Modeling long-term interactions to enhance action recognition

Conference Article

Conference

International Conference on Pattern Recognition (ICPR)

Edition

25th

Pages

10351-10358

Doc link

https://doi.org/10.1109/ICPR48806.2021.9412148

File

Download the digital copy of the doc pdf document

Authors

Abstract

In this paper, we propose a new approach to understand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels. At the frame level, we use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects and calculates the action score through a CNN formulation. This information is then fed to a Hierarchical Long Short-Term Memory Network (HLSTM) that captures temporal dependencies between actions within and across shots. Ablation studies thoroughly validate the proposed approach, showing in particular that both levels of the HLSTM architecture contribute to performance improvement. Furthermore, quantitative comparisons show that the proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks, without relying on motion information.

Categories

pattern recognition.

Scientific reference

A. Cartas, P. Radeva and M. Dimiccoli. Modeling long-term interactions to enhance action recognition, 25th International Conference on Pattern Recognition, 2021, Milan, Italy (Virtual), pp. 10351-10358.