Master Thesis

Self-supervised learning for action segmentation using a Transformer architecture

Work default illustration




  • Started: 01/04/2023
  • Finished: 21/09/2023


The focus of this project is to address the problem of Temporal Action Segmentation (TAS), which consist in temporally segment and classify fine-grained
actions in untrimmed videos. The enhancement of this procedure represents a significant albeit intricate challenge. Some of the main challenges for this
problem are that different actions can occur with different speed or duration, also some of them can be ambiguous and overlap. Successfully addressing this
challenge can yield substantial advancements in various domains of work, including robotics, medical support technologies, surveillance and many more.
Currently, the best performing state-of-the-art methods are fully-supervised.
Consequently, they require huge annotation cost, are not scalable and not suited for applications where data collection is costly. To alleviate this problem,
we propose a self-supervised transformer-based method for action segmentation, that does not require action labels, and demonstrate the effectiveness
of the learned weights in a weakly-supervised setting. Precisely we built a Siamese architecture based on an improvement version of an already existing
Transformer architecture. To validate our approach, we performed an ablation
study and compared our results with the state-of-the-art to draw some conclusion.
All the work is done using Pythorch as deep learning framework. The reason
for this choice are multiple like array-based programming, automatic differentiation to automate the calculation of derivatives, open source ecosystem, and
of strong library as ’torchvision’ and ’torc
Advances in automatic action segmentation in untrimmed videos would benefit the understanding and contextualizing of a video

The work is under the scope of the following projects:

  • GREAT: Beyond Graph Neural Networks: Joint graph topology learning and graph-based inference for computer vision (web)