Publication

Beyond static perception: Integrating temporal context into VLMs for cloth folding

Conference Article

Conference

ICRA Workshop on Representing and Manipulating Deformable Objects (RMDO)

Edition

2025

Pages

1-4

Doc link

https://deformable-workshop.github.io/icra2025/spotlight/01_04_13_Barbany_beyond.pdf

File

Download the digital copy of the doc pdf document

Abstract

Manipulating clothes is challenging due to their complex dynamics, high deformability, and frequent selfocclusions. Garments exhibit a nearly infinite number of configurations, making explicit state representations difficult to define. In this paper, we analyze BiFold, a model that predicts language-conditioned pick-and-place actions from visual observations, while implicitly encoding garment state through end-to-end learning. To address scenarios such as crumpled garments or recovery from failed manipulations, BiFold leverages temporal context to improve state estimation. We examine the internal representations of the model and present evidence that its fine-tuning and temporal context enable effective alignment between text and image regions, as well as temporal consistency

Categories

artificial intelligence, computer vision.

Author keywords

VLM, Robotic Cloth Folding, Deformable Object Manipulation, LoRA Fine-Tuning, Temporal Consistency

Scientific reference

O. Barbany, A. Colomé and C. Torras. Beyond static perception: Integrating temporal context into VLMs for cloth folding, 2025 ICRA Workshop on Representing and Manipulating Deformable Objects, 2025, , pp. 1-4.