Transformer-based models have driven significant advancements in Multimodal Large Language Models (MLLMs), yet their computational costs surge drastically when scaling resolution, training data, and model parameters. A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. On top of conventional token compression approaches, our method introduces two complementary acceleration strategies. For training acceleration, we observe that Feed-Forward Network (FFN) computations on visual tokens induce marginal feature updates. This motivates our Skip-FFN strategy, which bypasses FFN layers for redundant visual tokens. For inference acceleration, we design a selective KV-cache removal mechanism that prunes the skipped key-value pairs during decoding while preserving model performance.
Experimental results demonstrate that Skip-Vision reduces training time by up to 35%, inference FLOPs by 75%, and latency by 45%, while achieving comparable or superior performance to existing methods. Our work provides a practical solution for scaling high-performance MLLMs with enhanced efficiency.🚀Training speedup. We introduce a novel and efficient Skip-Vision framework, using token merge and skip FFN strategy during training to reduce redundant computation for visual tokens.
🚀Inference speedup. In inference, Skip-Vision employs a skip KV-cache mechanism that removes skip-FFN visual tokens from the KV-cache, enhancing efficiency.
😊Performance. Experiments show Skip-Vision’s superior efficiency, effective data scaling, and performance on par with state-of-the-art models of similar scale.
Performance-efficiency trade-off curve. Each circle denotes a model configuration, where our models utilize the Skip-Vision framework, with CoS, LLava-HD and LLaVA serving as baselines. Circle sizes reflect the inference FLOPs ratio. Skip-Vision demonstrate superior performance, scaling effectively with increased FLOPs and data and achieving higher inference efficiency when compared to baselines and other effcient MLLM methods. All methods utilize LLaMA3 8B as the foundational large language model.
We evaluate Skip-Vision on LLaVA, LLaVA-HD, and CoS and compare it with state-of-the-art efficiency optimization models under the LLaVA setting (LLaMA3 8B). Nr and Ns denote the number of retained and skipped tokens, respectively
Following the LLaVA-1.5-7B training setup, we conducted additional comparisons between Skip-Vision and several recent works.
We present the performance of SV-CoS on SV-9M, comparing it against the current SOTA models of a similar scale.
@misc{zeng2025skipvisionefficientscalableacceleration,
title={Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping},
author={Weili Zeng and Ziyuan Huang and Kaixiang Ji and Yichao Yan},
year={2025},
eprint={2503.21817},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.21817},
}
}