Research Project
AWESOME: Connecting past, present and future: Acquiring and leveraging prior knowledge to unveil the temporal structure of untrimmed videos
Type
National Project
Start Date
01/09/2024
End Date
31/08/2027
Project Code
PID2023-151351NB-I00
Staff
-
-
Bueno Benito, Elena Belén
PhD Student
-
Blázquez, Xabier
Master Student
-
Maté, Alberto
Support
Project Description
Project PID2023-151351NB-I00 funded by MCIN/ AEI /10.13039/501100011033 and by ERDF, UE
Temporal sequence encoding (TSE), the task of representing temporal sequences, is crucial for perception, action and learning, and it is considered a hallmark of human cognitive abilities. A fundamental aspect of TSE is that we perceive continuous temporal observations into discrete units, called events, and we use the temporal organization of events in our memory accumulated along a lifetime as prior knowledge to understand the present in the context of the past and to make inference about the future. Consequently, performing correctly TSE requires a very high level of temporal abstraction, that is, the idea of representing and reasoning about events, actions and changes at different levels of granularity and duration. Currently, this represents a major challenge for AI based systems.
The successor representation (SR) was originally introduced in the context of reinforcement learning as a representation defining state generalization by the similarity of successor states. Very recently, several works in both Machine Learning and Cognitive Science, are pointing independently to the ability of SR as a natural substrate to discovery and use temporal abstractions. Supported by these results, as well as by findings in Cognitive Science indicating that TSE is predictive and inferential, the goal of the current project is defining a novel framework for TSE that exhibit the temporal abstraction properties of SR and it is also able to acquire prior temporal knowledge and leveraging it at inference time.
The proposed methodological developments will be put to test on real-world data in the challenging computer vision tasks of temporal action segmentation, localization, and anticipation in untrimmed videos that require high level of temporal abstraction and would largely benefit of the use of prior knowledge about the unfolding of events. Given that Language, together with Vision, is a fundamental modality through which human beings acquire knowledge about the world to guide their future actions, we will leverage them to acquire prior knowledge.
The work plan of the project is structured around the following specific objectives:
1) develop a multiscale representation of videos able to capture temporal relations at different levels of granularity
2) learn from vision & language a multi-scale prior temporal knowledge about the unfolding of events
3) develop a framework for distilling prior knowledge learned from vision&language into hierarchical video representations at multiple scales
The application tasks will be developed in line with methodology, providing nurturing feedback to these three challenges. The research pursued by this project will constitute a significant theoretical advance in understanding how to make temporal sequence encoding predictive and inferential, so far a totally unexplored field. It will provide novel set of advanced methodological and operational tools for hierarchical representation of temporal sequences, action segmentation, localization and anticipation that will open the door to a new generation of high social impact applications in various fields.
Follow us!