AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

Qi Zhao*1, Ce Zhang*1, Shijie Wang1, Changcheng Fu1, Nakul Agarwal2, Kwonjoon Lee2, Chen Sun1

1Brown University, 2Honda Research Institute

Abstract

Can we better anticipate an actor’s future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor’s future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and “plans” the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned “counterfactual” prediction via qualitative analysis.

AntGPT and its novel capabilities

Result Visualization
A illustration of how AntGPT utilize language models for LTA.
Result Visualization
An illustration of AntGPT's top-down framework.

AntGPT is a vision-language framework that aims to explore how to incorporate the emergent capabilities of large language models into video long-term action anticipation (LTA). The LTA task is essentially given a video observation of human actions, to anticipate what is the future actions of the actor. To represent video information for LLMs, we use action recognition models to represent video observations into discrete action labels. They bridge visual information and language, allowing LLMs to perform the downstream reasoning tasks. We first query the LLM to infer the goals behind the observed actions. Then we incorporate the goal information into a vision-only pipeline to see if such goal-conditioned prediction is helpful. We also used LLM to directly model the temporal dynamics of human activity to see if LLM has built-in knowledge about action priors. Last but not least, we tested the capability of LLMs to perform the prediction tasks in a few-shot settings, using popular prompting strategies such as "Chain-of-Thought".

AntGPT shows novel capabilities:

  • Predicting goals through few-shot observations: We observe that LLMs are very capable of predicting the goals of the actor even given imperfect observed human actions. In context, we demonstrate a few successful examples in which the correct actions and goals are given. Then in the query, we input the observed actions sequences and let LLM output the goals.
  • Augmenting vision frameworks with goal information: To testify whether the output goals are helpful for the LTA task, we encoder goal information into textual features and incorporate it into a vision framework to perform "goal-conditioned" future prediction and observed improvement to the state-of-the-art.
  • Modeling action temporal dynamics: We then explored if LLMs can directly act as a reasoning backbone to model the temporal action dynamics. To this end, we fine-tuned LLMs on the in-domain action sequences and observed that LLMs can bring additional improvement than a transformer trained from-scratch.
  • Predicting future actions in few-shot setting: We further explored how do LLMs perform on the LTA task in a few-shot setting. When only demonstrating a few examples in the context, LLMs can still predict the future action sequences. Furthermore, we experimented with popular prompting strategies.
  • Key Research Questions We Explore

    Result Visualization Result Visualization
    Examples of the fine-tuned model outputs.
    Result Visualization
    Examples of in-context learning outputs.

    In the course of our research, we delve into these key questions in bridging LLMs with long-term action anticipation.

  • What is a good interface between videos and LLMs for the LTA task?: We experimented with various preprocessing techniques and found that representing video segments as discrete action labels are quite performant to interact with the LLMs, allowing LLMs to perform downstream reasoning from video observations.
  • Can LLMs infer the goals and are they helpful for top-down LTA?: We conclude that LLMs can infer the goals and they are particularly helpful for goal-conditioned top-down LTA. As demonstrated in our experiment, our goal-conditioned framework consistently performs better than our vision-only frameworks.
  • Do LLMs capture prior knowledge about temporal dynamics helpful for bottom-up LTA?: We found that fine-tuned LLM are better reasoning backbones than similar transformers trained from-scratch. Even with imperfect output structure and rough post-processing, LLMs still outperform their transformer peers.
  • Would knowing the goals affect the action predicted by LLMs in a few-shot setting?: We observe that all LLM-based methods performs much better than transformers under few-shot settings, especially on noun prediction, indicating the effectiveness of utilizing prior knowledge encoded in LLM for LTA task.
  • Goal-conditioned counterfactual predictions

    Result Visualization
    Examples of counterfactual predictions.
    Result Visualization
    Examples of counterfactual predictions.

    One interesting qualitative experiment we did is that since we conclude goals inferred by LLMs are useful and affects LLMs during in-context learning. We want to see if we give an alternative goal instead of the truly inferred goal, how would the output of LLM be affected? We observed that LLMs do respond according to the goal. For example, when we switch the inferred goal "fix machines" into "examine machines", LLMs will predict some actions that are exclusively related to "examine machines" like "read gauge", "record data", etc.

    BibTeX

    @article{zhao2023antgpt,
            title={AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?},
            author={Qi Zhao and Ce Zhang and Shijie Wang and Changcheng Fu and Nakul Agarwal and Kwonjoon Lee and Chen Sun},
            journal={arXiv preprint arXiv:2307.16368},
            year={2023}
         }