Learning from (procedural) videos has increasingly served as a pathway for embodied agents to acquire skills from human demonstrations. To do this, video understanding models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel environments, tasks, and problem domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) video question answering, over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to: (1) generalize to novel domains; (2) utilize long temporal context and multimodal (e.g., visual and speech) information. Our extensive experimental analysis highlights the challenges of Spacewalk-18, but also suggests best practices for domain generalization and long-form understanding. Notably, we discover a promising adaptation via summarization technique that leads to significant performance improvement without model fine-tuning.
In spacewalk missions, astronauts leave the space station to perform various tasks (e.g., equipment repairs). While each spacewalk mission often lasts a few hours, it follows a fairly rigid agenda, which is reviewed step by step in a short animated video. In Spacewalk-18, we collected 18 spacewalk recordings from the web. As illustrated above, the steps in spacewalk mission are manually localized in the videos. In the following video, we showcase a few annotated steps, each including its animation and several corresponding clips from the spacewalk videos.
Spacewalk-18 includes 18 spacewalk videos from 2019 to 2023 with a total length of 96 hours. There are 455 step labels across the dataset. On average, each spacewalk mission consists of 25 steps. More statistics are in the following figures.
We propose two long-form video understanding tasks on Spacewalk-18: Step recognition and Video question answering.
Step recognition gives a spacewalk video clip and aims to recognize the step corresponding to the middle of the video.
Examples:
Video question answering includes 376 multi-choice questions with hour-long query videos.
They include questions automatically generated from step annotations and manually annotated questions.
Examples:
(As an entire video is hour-long, we only show a few key clips here.)
Question: Which of the following tasks happens after the astronaut leaves from the robotic arm?
(A) Robotic arm takes Luca to AMS
(B) Luca & Drew connect power and data cables
(C) Drew move to ELC 2
(D) Drew hands Luca the pump system
Question: In which part of the video does the task that EV1 & EV2 install respective bags on worksites happen?
(A) The task does not happen in the video
(B) The first third of the video
(C) The last third of the video
(D) The middle third of the video
We evaluate a few Vision-Language Models and Multi-modal Large Language Models. All the models perform poorly on our tasks and significantly worse than humans.
@misc{krishnan2023spacewalk18,
title={Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains},
author={Zitian Tang and Rohan Myer Krishnan and Zhiqiu Yu and Chen Sun},
year={2023},
eprint={2311.18773},
archivePrefix={arXiv},
primaryClass={cs.CV}
}