PhD Thesis

Human Intention Learning with Large Concept and Vision Models: A Multimodal Framework for Collaborative Robotics

Work default illustration

Information

  • Started: 01/01/2025

Description

This thesis proposes a conceptual framework for learning and detecting human intention, drawing upon the taxonomy presented in The Human Intention. A taxonomy attempt and its applications to robotics. The proposed approach integrates Large Concept Models (LCMs) and Large Vision Models (LVMs) to comprehensively capture both linguistic and contextual signals, as well as information derived from the visual environment. LCMs are conceptualised as models that represent semantic information at a level higher than words or tokens, thereby encompassing broader and more complex concepts. Conversely, LVMs facilitate the inference of contextual information from vision, thereby enhancing the comprehension of environmental and gestural factors that are pivotal for interpreting human intention.

The central focus of this study is the development of hybrid inference algorithms that integrate deep learning techniques with probabilistic reasoning, incorporating multimodal strategies to unify verbal, nonverbal, and visual cues. The proposed framework provides a robust architecture for detecting and anticipating intention in collaborative tasks, such as joint search and object transport, within the ROS platform. By offering a more holistic perspective of human behaviour, this approach aims to advance safer, more reliable and more empathetic robotic systems, while also prompting reflections on ethical considerations and future research directions in human-robot interaction.