Master Thesis

Learning multisensory representations from unlabeled videos

Work default illustration




  • Started: 17/09/2019


Self-supervised methods learn features without human supervision, by training a model to solve a task derived from the input data itself. These methods are particularly attractive since they do not require time-consuming and expensive data annotations.

In this project the student will define a self-supervised deep learning model that is able to learn a multisensory representation, that is, a representation that fuses the visual and audio components of a video signal. The learned video representation will be evaluated quantitatively on tasks for which the temporal encoding is crucial, such as temporal action segmentation and action prediction. There is the possibility of requesting a grant for outstanding students.

Student profile:
The main requirement for this project is fluency in Python. A strong background and knowledge of machine learning will be a plus. For any questions, please contact Mariella Dimiccoli.