Publication

3M-Transformer: A multi-stage multi-stream multimodal transformer for embodied turn-taking prediction

Conference Article

Conference

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Edition

2024

Pages

8050-8054

Doc link

https://doi.org/10.1109/ICASSP48485.2024.10448136

File

Download the digital copy of the doc pdf document

Abstract

Predicting turn-taking in multiparty conversations has many practical applications in human-computer/robot interaction. However, the complexity of human communication makes it a challenging task. Recent advances have shown that synchronous multi-perspective egocentric data can significantly improve turn-taking prediction compared to asynchronous, single-perspective transcriptions. Building on this research, we propose a new multimodal transformer-based architecture for predicting turn-taking in embodied, synchronized multi-perspective data. Our experimental results on the recently introduced EgoCom dataset show a substantial performance improvement of up to 14.01% on average compared to existing baselines and alternative transformer-based approaches.

Categories

pattern recognition.

Author keywords

cross-modal transformer, turn-taking prediction, embodied multi-perspective data, audio-video-text

Scientific reference

M. Fatan, E. Mincato, D. Pintzou and M. Dimiccoli. 3M-Transformer: A multi-stage multi-stream multimodal transformer for embodied turn-taking prediction, 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024, Seoul, Korea, pp. 8050-8054.