Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
Published in CVPR 2026, 2025
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. We identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose V2Drop, which progressively removes visual tokens with minimal variation during LVLM inference, maintaining 94.0% and 98.6% of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by 31.5% and 74.2%.
@inproceedings{chen2026v2drop,
title={Variation-aware Vision Token Dropping for Faster Large Vision-Language Models},
author={Chen, Junjie and Liu, Xuyang and Wen, Zichen and Wang, Yiyu and Huang, Siteng and Chen, Honggang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}