Publication

Seeing and hearing egocentric actions: How much can we learn?

Conference Article

Conference

ICCV Workshop on Egocentric Perception, Interaction and Computing (EPIC)

Edition

2019

Pages

4470-4480

Doc link

https://doi.org/10.1109/ICCVW.2019.00548

File

Download the digital copy of the doc pdf document

Authors

Abstract

Our interaction with the world is an inherently multi-modal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial,and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a5.18%improvement over the state of the art on verb classification

Categories

pattern recognition.

Scientific reference

A. Cartas, J. Luque, P. Radeva, C. Segura and M. Dimiccoli. Seeing and hearing egocentric actions: How much can we learn?, 2019 ICCV Workshop on Egocentric Perception, Interaction and Computing, 2019, Seoul, South Corea, pp. 4470-4480.