Vamos: Versatile Action Models for Video Understanding

Brown University      Honda Research Institute

Abstract

What makes good video representations for video understanding, such as anticipating future activities, or answering video-conditioned questions? While earlier approaches focus on end-to-end learning directly from video pixels, we propose to revisit text-based representations, such as discrete action labels, or free-form video captions, which are interpretable and can be directly consumed by large language models (LLMs). Intuitively, different video understanding tasks may require representations that are complementary and at different granularities. To this end, we propose versatile action models (Vamos), a learning framework powered by a large language model as the ``reasoner'', and can flexibly leverage visual embeddings, action labels, and free-form descriptions extracted from videos as its input. We evaluate Vamos on four complementary video understanding benchmarks, Ego4D, Next-QA, IntentQA, and EgoSchema, on its capability to model temporal dynamics, encode visual history, and perform reasoning. Surprisingly, we observe that text-based representations consistently achieve competitive performance on all benchmarks, and that visual embeddings provide marginal or no performance improvement, demonstrating the effectiveness of text-based video representation in the LLM era. We perform extensive ablation study and qualitative analysis to support our observations, and achieve state-of-the-art performance on three benchmarks.

Vamos: Versatile Action Models

We introduce Vamos, a simple yet effective framework to utilize LLMs to unify video dynamic modeling tasks, including comprehending historical content (video question answering, VQA) and future prediction (long-term action anticipation, LTA). Vamos flexibly unifies distributed visual features and textual video representations including discrete action labels and free-form video captions.

Vamos Model

Lightweight Token Selector

To investigate the potential of compressing long input sequences and extracting keys element from the generated textual representations to probe the crucial information for downstream video understanding tasks, we introduce a lightweight token selector as an add-on module of Vamos. It pick one single token from the video sequence based on the task sequence. By applying the token selector to k segments of textual video representation, k tokens are finally selected as the compact input to the LLM for down-stream tasks.

Vamos selector

Visualization of Vamos Prediction and Manual Intervention

Benchmark Results

We compare Vamos with other state-of-the-art models on the four benchmarks: EgoSchema, NeXT-QA, IntentQA, and Ego4D LTA.

Acknowledgements

This work is supported by Honda Research Institute USA and Samsung Advanced Institute of Technology. We would like to thank Apoorv Khandelwal, Calvin Luo, David Isele, Songpo Li, and Tian Yun for their generous feedback on this work. Our research was conducted using computational resources at the Center for Computation and Visualization, Brown University.

BibTeX

@misc{wang2023vamos,
        title={Vamos: Versatile Action Models for Video Understanding}, 
        author={Shijie Wang and Qi Zhao and Minh Quan Do and Nakul Agarwal and Kwonjoon Lee and Chen Sun},
        year={2023},
        eprint={2311.13627},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }