IRI - AWESOME: Connecting past, present and future: Acquiring and leveraging prior knowledge to unveil the temporal structure of untrimmed videos

Research Project

AWESOME: Connecting past, present and future: Acquiring and leveraging prior knowledge to unveil the temporal structure of untrimmed videos

Type

National Project

Start Date

01/09/2024

End Date

31/08/2027

Project Code

PID2023-151351NB-I00

Staff

Dimiccoli, Mariella

Principal Investigator
Chen, Zhijin

PhD Student

Bueno Benito, Elena Belén

PhD Student

Project Description

Project PID2023-151351NB-I00 funded by MCIN/ AEI /10.13039/501100011033 and by ERDF, UE

Temporal sequence encoding (TSE), the task of representing temporal sequences, is crucial for perception, action and learning, and it is considered a hallmark of human cognitive abilities. A fundamental aspect of TSE is that we perceive continuous temporal observations into discrete units, called events, and we use the temporal organization of events in our memory accumulated along a lifetime as prior knowledge to understand the present in the context of the past and to make inference about the future. Consequently, performing correctly TSE requires a very high level of temporal abstraction, that is, the idea of representing and reasoning about events, actions and changes at different levels of granularity and duration. Currently, this represents a major challenge for AI based systems.

The proposed methodological developments will be put to test on real-world data in the challenging computer vision tasks of temporal action segmentation, localization, and anticipation in untrimmed videos that require high level of temporal abstraction and would largely benefit of the use of prior knowledge about the unfolding of events. Given that Language, together with Vision, is a fundamental modality through which human beings acquire knowledge about the world to guide their future actions, we will leverage them to acquire prior knowledge.

The work plan of the project is structured around the following specific objectives:
1) develop a multiscale representation of videos able to capture temporal relations at different levels of granularity
2) learn from vision & language a multi-scale prior temporal knowledge about the unfolding of events
3) develop a framework for distilling prior knowledge learned from vision&language into hierarchical video representations at multiple scales

The application tasks will be developed in line with methodology, providing nurturing feedback to these three challenges. The research pursued by this project will constitute a significant theoretical advance in understanding how to make temporal sequence encoding predictive and inferential, so far a totally unexplored field. It will provide novel set of advanced methodological and operational tools for hierarchical representation of temporal sequences, action segmentation, localization and anticipation that will open the door to a new generation of high social impact applications in various fields.

Project Publications

Conference Publications

E.B. Bueno Benito and M. Dimiccoli. 2by2: weakly-supervised learning for global action segmentation, 27th International Conference on Pattern Recognition, 2024, Kolkata, in Pattern Recognition, Vol 15315 of Lecture Notes in Computer Science, pp. 380-395, 2024, Cham.

Abstract Info PDF