Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

Published in EMNLP 2025 Main Conference, 2025

Video large language models (VideoLLM) excel at video understanding, but face efficiency challenges due to the quadratic complexity of abundant visual tokens. Our systematic analysis of token compression methods for VideoLLMs reveals two critical issues: (i) overlooking distinctive visual signals across frames, leading to information loss; (ii) suffering from implementation constraints, causing incompatibility with modern architectures or efficient operators. To address these challenges, we distill three design principles for VideoLLM token compression and propose a plug-and-play inference acceleration framework “Video Compression Commander” (VidCom2).

Links: Paper Code

Recommended citation: Xuyang Liu, Yiyu Wang, Junpeng Ma, and Linfeng Zhang. (2025). "Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models." Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

@inproceedings{liu2025vidcom2,
  title={Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models},
  author={Liu, Xuyang and Wang, Yiyu and Ma, Junpeng and Zhang, Linfeng},
  booktitle={Proceedings of the Conference on Empirical Methods in Natural Language Processing},
  year={2025}
}

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)