Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding

Brown University
*Indicates Equal Contribution
arXiv Dataset


Learning from videos is an emerging research area that enables robots to acquire skills from human demonstrations, such as procedural videos. To do this, video-language models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel domains. In pursuit of this goal, we introduce Spacewalk-18, a benchmark containing two tasks: (1) step recognition and (2) intra-video retrieval over a dataset of temporally segmented and labeled tasks in International Space Station spacewalk recordings. In tandem, the two tasks quantify a model's ability to make use of: (1) out-of-domain visual information; (2) a high temporal context window; and (3) multimodal (e.g. visual and speech) domains. This departs from existing benchmarks for procedural video understanding, which typically deal with short context lengths and can be solved with a single modality. Spacewalk-18, with its inherent multimodal and long-form complexity, exposes the high difficulty of task recognition and segmentation. We find that state-of-the-art methods perform poorly on our benchmark, but improvements can be obtained by incorporating information from longer-range temporal context across different modalities. Our experiments underscore the need to develop new approaches to these tasks.

Spacewalk-18 Dataset



Spacewalk-18 annotates 18 spacewalk videos from 2019 to 2023 with a total length of 96 hours. On average, each spacewalk task consists of 25 steps. Each step has an average of 12 minutes of animation video.


Two multimodal long-form video understanding tasks are defined on Spacewalk-18 - step recognition and intra-video retrieval.

Step recognition. Given a timestamp and a context window length, step recognition aims to recognize the task step that the timestamp belongs to.

Intra-video retrieval. Given a query timestamp, two candidate timestamps with the same time distances to the query, and a context window length, intra-video retrieval aims to determine the candidate that belongs to the same task step as the query.


Evaluation Results

We evaluate a few pretrained models in both zero-shot and fine-tuning scenarios. All the models perform poorly on our tasks and significantly worse than humans.


        title={Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains}, 
        author={Rohan Myer Krishnan and Zitian Tang and Zhiqiu Yu and Chen Sun},