Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping

ICCV 2025
Weili Zeng1, Ziyuan Huang2, Kaixiang Ji2, Yichao Yan1,
1Shanghai Jiao Tong University 2Ant Group
Interpolate start reference image.

The framework of Skip-Vision. a) While visual scaling enriches visual information, it also increases computational overhead. Skip-Vision uses a skip-FFN strategy during training to reduce redundant computation for visual tokens. The numerous skipped tokens will be limited to the attention layer and bypass FFN. b) At the beginning of inference, Skip-Vision will remove skip-FFN visual tokens from the initial KV-cache, enhancing efficiency. c) During inference, skip attention leverages the skip KV-cache to accelerate generation.

Abstract

Transformer-based models have driven significant advancements in Multimodal Large Language Models (MLLMs), yet their computational costs surge drastically when scaling resolution, training data, and model parameters. A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. On top of conventional token compression approaches, our method introduces two complementary acceleration strategies. For training acceleration, we observe that Feed-Forward Network (FFN) computations on visual tokens induce marginal feature updates. This motivates our Skip-FFN strategy, which bypasses FFN layers for redundant visual tokens. For inference acceleration, we design a selective KV-cache removal mechanism that prunes the skipped key-value pairs during decoding while preserving model performance.

Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%, while achieving comparable or superior performance to existing methods. Our work provides a practical solution for scaling high-performance MLLMs with enhanced efficiency.

Highlights

Skip-Vision is designed to be seamlessly integrated into the standard SFT pipeline of MLLMs without introducing additional re-training or decoupled modules. It directly modifies the transformer’s computation flow (skip-FFN + token merge + adaptive summary token) and inference (skip-KV cache), offering a practical and theoretically grounded acceleration solution for MLLM training and inference jointly.

🚀Training speedup. We introduce a novel and efficient Skip-Vision framework, using token merge and skip FFN strategy during training to reduce redundant computation for visual tokens.

🚀Inference speedup. In inference, Skip-Vision employs a skip KV-cache mechanism that removes skip-FFN visual tokens from the KV-cache, enhancing efficiency.

😊Performance. Experiments show Skip-Vision’s superior efficiency, effective data scaling, and performance on par with state-of-the-art models of similar scale.

Interpolate start reference image.

Performance-efficiency trade-off curve. Each circle denotes a model configuration, where our models utilize the Skip-Vision framework, with CoS, LLava-HD and LLaVA serving as baselines. Circle sizes reflect the inference FLOPs ratio. Skip-Vision demonstrate superior performance, scaling effectively with increased FLOPs and data and achieving higher inference efficiency when compared to baselines and other effcient MLLM methods. All methods utilize LLaMA3 8B as the foundational large language model.

Performance

MMVet, MMStar and MMBench highlight Skip-Vision’s strength in capturing causal and global information. These benchmarks emphasize high-level reasoning and abstraction, which benefit from Skip-Vision’s ability to preserve essential information flow while reducing redundant computations. By skipping FFN and KV-cache for less informative tokens, the model amplifies signal from key visual cues and enhances causal token interactions. While this comes with a slight trade off in fine-grained tasks (OCR, Textvqa), it reflects a deliberate balance between perception and reasoning, favoring tasks that rely on semantic integration over detail fidelity.
Interpolate start reference image.

We evaluate Skip-Vision on LLaVA, LLaVA-HD, and CoS and compare it with state-of-the-art efficiency optimization models under the LLaVA setting (LLaMA3 8B). Nr and Ns denote the number of retained and skipped tokens, respectively

Interpolate start reference image.

Following the LLaVA-1.5-7B training setup, we conducted additional comparisons between Skip-Vision and several recent works.

Interpolate start reference image.

We present the performance of SV-CoS on SV-9M, comparing it against the current SOTA models of a similar scale.

BibTeX

@misc{zeng2025skipvisionefficientscalableacceleration,
      title={Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping}, 
      author={Weili Zeng and Ziyuan Huang and Kaixiang Ji and Yichao Yan},
      year={2025},
      eprint={2503.21817},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.21817}, 
}
}