Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

Published in CVPR 2026, 2025

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. We identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose V2Drop, which progressively removes visual tokens with minimal variation during LVLM inference, maintaining 94.0% and 98.6% of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by 31.5% and 74.2%.

Recommended citation: Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, and Honggang Chen. (2026). "Variation-aware Vision Token Dropping for Faster Large Vision-Language Models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
@inproceedings{chen2026v2drop,
  title={Variation-aware Vision Token Dropping for Faster Large Vision-Language Models},
  author={Chen, Junjie and Liu, Xuyang and Wen, Zichen and Wang, Yiyu and Huang, Siteng and Chen, Honggang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}