Master Thesis

Self-supervised deep learning of multimodal representations for turn-taking prediction

Work default illustration



  • If you are interested in the proposal, please contact with the supervisors.


Turn-taking prediction, understood as the task of predicting who is going to talk seconds ahead, is a fundamental task for conversational systems, with numerous human-centered applications, such as early diagnosis and intervention for communication disorders like autism, conversational systems, human-robot communications to name but a few.

Recently, supervised deep learning approaches have achieved impressive performance in a large variety of AI tasks. However, these methods rely on the availability of huge amount of labeled data. In several applications, labeled data may be scarcely available or they may be difficult to acquire,
e.g. they may require expert knowledge.
To cope with this problem, self-supervised approaches have emerged as a new deep learning paradigm allowing to train a model on a proxy-task with pseudo-labels that come from free from the data themselves, hence without requiring any manual annotation.

In this project, the student will develop a multimodal self-supervised approach, able to learn an embedded space for multimodal data (audio, video and text). The embedded representation will be validated on the challenging task of turn-taking prediction, which is traditionally approached in a fully supervised fashion.

Student profile:
The main requirement for this project is fluency in Python. Background and knowledge of deep learning will be a plus. For any questions, please contact Mariella Dimiccoli