Publication
3M-Transformer: A multi-stage multi-stream multimodal transformer for embodied turn-taking prediction
Conference Article
Conference
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Edition
2024
Pages
8050-8054
Doc link
https://doi.org/10.1109/ICASSP48485.2024.10448136
File
Abstract
Predicting turn-taking in multiparty conversations has many practical applications in human-computer/robot interaction. However, the complexity of human communication makes it a challenging task. Recent advances have shown that synchronous multi-perspective egocentric data can significantly improve turn-taking prediction compared to asynchronous, single-perspective transcriptions. Building on this research, we propose a new multimodal transformer-based architecture for predicting turn-taking in embodied, synchronized multi-perspective data. Our experimental results on the recently introduced EgoCom dataset show a substantial performance improvement of up to 14.01% on average compared to existing baselines and alternative transformer-based approaches.
Categories
pattern recognition.
Author keywords
cross-modal transformer, turn-taking prediction, embodied multi-perspective data, audio-video-text
Scientific reference
M. Fatan, E. Mincato, D. Pintzou and M. Dimiccoli. 3M-Transformer: A multi-stage multi-stream multimodal transformer for embodied turn-taking prediction, 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024, Seoul, Korea, pp. 8050-8054.
Follow us!