Publication
Seeing and hearing egocentric actions: How much can we learn?
Conference Article
Conference
ICCV Workshop on Egocentric Perception, Interaction and Computing (EPIC)
Edition
2019
Pages
4470-4480
Doc link
https://doi.org/10.1109/ICCVW.2019.00548
File
Authors
-
Cartas, Alejandro
-
Luque, Jordi
-
Radeva, Petia
-
Segura, Carlos
-
Dimiccoli, Mariella
Abstract
Our interaction with the world is an inherently multi-modal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial,and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a5.18%improvement over the state of the art on verb classification
Categories
pattern recognition.
Scientific reference
A. Cartas, J. Luque, P. Radeva, C. Segura and M. Dimiccoli. Seeing and hearing egocentric actions: How much can we learn?, 2019 ICCV Workshop on Egocentric Perception, Interaction and Computing, 2019, Seoul, South Corea, pp. 4470-4480.
Follow us!