Spacewalk-18:
A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains

Zitian Tang^*, Rohan Myer Krishnan^*, Zhiqiu Yu , Chen Sun

Brown University
^*Indicates Equal Contribution

arXiv Dataset

Abstract

Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g., visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning.

teaser

Spacewalk-18 Dataset

annotation

In spacewalk missions, astronauts leave the space station to perform various tasks (e.g., equipment repairs). While each spacewalk mission often lasts a few hours, it follows a fairly rigid agenda, which is reviewed step by step in a short animated video. In Spacewalk-18, we collected 18 spacewalk recordings from the web. As illustrated above, the steps in spacewalk mission are manually localized in the videos. In the following video, we showcase a few annotated steps, each including its animation and several corresponding clips from the spacewalk videos.

Statistics

Spacewalk-18 includes 18 spacewalk videos from 2019 to 2023 with a total length of 96 hours. There are 455 step labels across the dataset. On average, each spacewalk mission consists of 25 steps. More statistics are in the following figures.

obj_act_distribution

Distribution of (top) Duration of video clips, (middle) Duration of steps, and (bottom) Number of clips per step. A clip refers to a consecutive video segment that corresponds to the same step.

obj_act_distribution

Distribution of objects and actions. There are 51 diverse objects and 47 atomic actions occur in the step labels. This figure shows the frequency of each object and action in all the steps.

Tasks

We propose two long-form video understanding tasks on Spacewalk-18: Step recognition and Video question answering.

Step Recognition

Step recognition gives a spacewalk video clip and aims to recognize the step corresponding to the middle of the video.
Examples:

Step label: EV1 & EV2 exit airlock.

Step label: Bob installs adapter plate in open slot 2.

Video Question Answering

Video question answering includes 376 multi-choice questions with hour-long query videos. They include questions automatically generated from step annotations and manually annotated questions.
Examples:
(As an entire video is hour-long, we only show a few key clips here.)

0th minute

6th minute

18th minute

42nd minute

60th minute

Question: Which of the following tasks happens after the astronaut leaves from the robotic arm?
(A) Robotic arm takes Luca to AMS
(B) Luca & Drew connect power and data cables
(C) Drew move to ELC 2
(D) Drew hands Luca the pump system

0th minute

18th minute

36th minute

48th minute

60th minute

Question: In which part of the video does the task that EV1 & EV2 install respective bags on worksites happen?
(A) The task does not happen in the video
(B) The first third of the video
(C) The last third of the video
(D) The middle third of the video

Evaluation Results

We evaluate a few Vision-Language Models and Multi-modal Large Language Models. All the models perform poorly on our tasks and significantly worse than humans.

Evaluation on the step recognition task

Evaluation on the video question answering task

BibTeX

@misc{krishnan2023spacewalk18,
        title={Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains}, 
        author={Zitian Tang and Rohan Myer Krishnan and Zhiqiu Yu and Chen Sun},
        year={2023},
        eprint={2311.18773},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }