How multimodal large language models (MLLMs) perceive the visual world remains a mystery. To one extreme, object and relation modeling may be implicitly implemented with inductive biases, for example by treating objects as tokens. To the other extreme, empirical results reveal the surprising finding that simply performing visual captioning, which tends to ignore spatial configuration of the objects, serves as a strong baseline for video understanding. We aim to answer the question: how can objects help video-language understanding in MLLMs? We tackle the question from the object representation and adaptation perspectives. Specifically, we investigate the trade-off between representation expressiveness (e.g., distributed versus symbolic) and integration difficulty (e.g., data-efficiency when learning the adapters). Through extensive evaluations on five video question answering datasets, we confirm that explicit integration of object-centric representation remains necessary, and the symbolic objects can be most easily integrated while being performant for question answering. We hope our findings can encourage the community to explore the explicit integration of perception modules into MLLM design.
In the era of MLLMs, anything that can be tokenized has the potential to serve as a valid representation. Along the spectrum are two extremes: those that project arbitrary distributed representations to the input space of LLMs, and those that model the visual world as interpretable concepts or captions. It is open to debate whether they can effectively convey the complexity of the visual world to an LLM reasoner. For example, video captions may struggle to describe the spatial and temporal configurations of objects. We hypothesize that explicit object-centric recognition and modeling remains essential to the success of MLLMs. We seek to answer the question, how can objects help video-language understanding in MLLMs, from two perspectives: representation and adaptation.
We propose ObjectMLLM, a multimodal framework integrating distributed visual embeddings, video frame captions, and object bounding boxes into one MLLM.
With external object detection and tracking models, we capture object bounding boxes from videos.
Then we feed the object labels, timestamps, and bounding boxes to the model. The model backbone can be either an LLM (e.g., LLaMA3) or MLLM (e.g., VideoLLaMA2).
We explore two choices of bounding box adapters.
The language-based representation naturally represents the bounding boxes as texts, while the embedding projector learns a projection from the bounding box space to LLM input space.
Language-based representation adapts bounding boxes better. We evaluate the two choices of obejct bounding box adapter across various sizes of training data. The language-based representation consistently outperforms the embedding projector, indicating its effectiveness and data-efficiency.
Object bounding boxes improve spatio-temporal understanding. We evaluate ObjectMLLM on five Video QA benchmarks with different combinations of representations. The object bounding boxes especially improve the model performance on CLEVRER-MC, Perception Test, and STAR, which have questions related to spatio-temporal object configurations.
Main results. Compared with previous works, ObjectMLLM outperforms them by a large margin on CLEVRER-MC and Perception Test, and achieves comparable performance on other benchmarks.
Question: How many moving metal objects are there?
(A) 2
(B) 1
(C) 3
(D) 4 ✓
Question: What happened once the person removed an object from the tabletop?
(A) The launched object fell off the table.
(B) The launched object did not fall off the table. ✓
(C) No object was removed from the tabletop.
Question: Is the camera moving or static?
(A) Moving. ✓
(B) Static or shaking.
(C) I don't know.
Question: Is the configuration of objects likely to be stable after placing the last object?
(A) One cannot judge the stability of this configuration.
(B) The configuration is likely to be stable. ✓
(C) The configuration is likely to be unstable.
Question: What object does the person use to hit other objects?
(A) pen
(B) fork
(C) spoon ✓
Question: Is there something unusual about the way the person ties the shoe laces?
(A) The person ties correctly the left shoe lace, but not the right shoe lace.
(B) The person ties the shoe laces normally. ❌
(C) The person ties the lace of the left shoe to the lace of the right shoe.
Question: Why is the man kneeling down on the floor?
(A) Feed the dog. ❌
(B) Crawling around.
(C) Let kids walk through.
(D) Fell down.
(E) Pet the dog.
@misc{tang2025objectmllm,
title={How Can Objects Help Video-Language Understanding?},
author={Zitian Tang and Shijie Wang and Junho Cho and Jaewook Yoo and Chen Sun},
year={2025},
eprint={2504.07454},
archivePrefix={arXiv},
primaryClass={cs.CV}
}