V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models

Published in ECCV 2026 (Under Review), 2026

Video large language models (VideoLLMs) show strong capability in video understanding, yet long-context inference is still dominated by massive redundant visual tokens in the prefill stage. We revisit token compression for VideoLLMs under a tight budget and identify a key bottleneck, namely insufficient spatio-temporal information coverage. We propose V-CAST (Video Curvature-Aware Spatio-Temporal Pruning), a training-free, plug-and-play pruning policy for long-context video inference. V-CAST casts token compression as a trajectory approximation problem and introduces a curvature-guided temporal allocation module that routes per-frame token budgets to semantic turns and event boundaries. Extensive experiments demonstrate that V-CAST achieves 98.6% of the original performance, outperforms the second-best method by +1.1% on average, and reduces peak memory and total latency to 86.7% and 86.4% of vanilla Qwen3-VL-8B-Instruct.

Recommended citation: Xinying Lin, Xuyang Liu, Yiyu Wang, Teng Ma, and Wenqi Ren. (2026). "V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models." arXiv preprint arXiv:2603.27650.
@article{lin2026vcast,
  title={V-CAST: Video Curvature-Aware Spatio-Temporal Pruning for Efficient Video Large Language Models},
  author={Lin, Xinying and Liu, Xuyang and Wang, Yiyu and Ma, Teng and Ren, Wenqi},
  journal={arXiv preprint arXiv:2603.27650},
  year={2026}
}