SAGE: A Unified Framework for Generalizable Object State Recognition with State-Action Graph Embedding

Yuan Zang, Zitian Tang, Junho Cho, Jaewook Yoo, Chen Sun

Brown University Samsung Electronics

arXiv (Coming soon) Code (Coming soon)

Abstract

Recognizing the physical states of objects and their transformations within videos is crucial for structured video understanding and enabling robust real-world applications, such as robotic manipulation. However, pretrained vision-language models often struggle to capture these nuanced dynamics and their temporal context, and specialized object state recognition frameworks struggle with generalizing to unseen actions or objects. We introduce SAGE (State-Action Graph Embeddings), a novel framework that offers a unified model of physical state transitions by decomposing states into fine-grained, language-described visual concepts that are sharable across different objects and actions. SAGE initially leverages Large Language Models to construct a State-Action Graph, which is then multimodally refined using Vision-Language Models. Extensive experimental results show that our method significantly outperforms existing baselines, generalizes effectively to unseen objects and actions in open-world settings. Our method improves the prior state-of-the-art by as much as 14.6% on novel state recognition with less than 5% of its inference time. Our code and data will be publicly released.

SAGE: Stage-Action Graph Embeddings

We introduce State-Action Graph Embeddings (SAGE), a novel framework to capture the the physical state transtions of objects in videos. SAGE leverages the knowledge of LLMs to identify a set of object states under an action and further decompose them into fine-grained visual concepts. The visual concepts of each state are encoded and combined together to serve as an informative object state embeddings.

annotation

With the help of the LLM, we construct a graph for the visual concepts, object states, and actions. This graph can also be refined by the VLMs after the first round of training. Because the visual concepts can be shared between different object states, they enable SAGE to generalize to novel objects and actions that are unseen during training.

annotation

Model Architecture

To recognize frame-wise object states in videos, we train a Transformer to produce contextualized visual embeddings for the video frames. The visual embeddings are then matched with the object state embeddings provided by SAGE. We perform Viterbi Decoding, which leverages the temporal constraints between the object states, to derive frame-level object state predictions.

annotation

Experiment Results

We evaluate SAGE and baseline methods in three setting: a) with action and object labels provided, b) only action label provided, and c) no action and object label provided. For fair comparison, we use CLIP as the VLM encoder for all of LFC, VidOSC, MTC, and SAGE. Our method outperforms the baseline in all settings.

annotation

We further evaluate the model generalization to objects and actions that are unseen during training. SAGE shows superior generalizability compared to the baseline models.

annotation

When using VideoCLIP as the VLM encoder in our method, SAGE achieves the state-of-the-art performance on the object state recognition task. Especially, SAGE outperforms previous state-of-the-art in novel state recognition on ChangeIt (Open-world) by a large margin of 14.6%.

annotation

Qualitative Results

We show the visual concepts that are aligned to the video frames by our model. The blue and green concepts are correctly aligned to the frame, while the green ones are shared between different object states. The red concepts are wrongly aligned. Our model aligned the video frames with most of their correct visual concepts. And this capability is generalizable to novel objects and actions.

annotation

BibTeX

TBD