Beyond Attentive Tokens: Efficient Vision Transformers with Token Importance and Diversity

1. Introduction

Vision Transformers (ViTs) have revolutionized computer vision tasks but suffer from quadratic computational complexity due to self-attention mechanisms. Existing token pruning methods focus primarily on token importance, preserving \"attentive\" tokens while discarding \"inattentive\" ones. However, this approach overlooks global token diversity, which is crucial for model expressivity. This paper introduces a novel token decoupling and merging method that jointly optimizes for both token importance and diversity.

Key Performance Metrics

DeiT-S: 35% FLOPs reduction with only 0.2% accuracy drop

DeiT-T: 40% FLOPs reduction with 0.1% accuracy improvement

2. Methodology

2.1 Token Decoupling

Based on class token attention scores, we separate tokens into attentive and inattentive groups. The attention score for token $i$ is computed as $A_i = \\text{softmax}\\left(\\frac{Q_{cls}K_i^T}{\\sqrt{d}}\\right)$, where $Q_{cls}$ is the class token query and $K_i$ is the key for token $i$.

2.2 Token Merging

We preserve the most discriminative local tokens from the attentive group while merging similar inattentive tokens using clustering algorithms. The merging process minimizes information loss while maximizing token diversity.

2.3 Mathematical Formulation

The overall objective function combines importance preservation and diversity maximization: $L = \\alpha L_{imp} + \\beta L_{div}$, where $L_{imp}$ ensures important tokens are preserved and $L_{div}$ promotes diversity through clustering regularization.

3. Experiments and Results

3.1 Experimental Setup

We evaluate our method on ImageNet-1K using DeiT-S and DeiT-T architectures. Comparison methods include DyViT and EViT for importance-based pruning and naive clustering for diversity-based approaches.

3.2 Performance Comparison

Our method achieves state-of-the-art performance across different keep rates. On DeiT-S, we reduce FLOPs by 35% with only 0.2% accuracy drop, outperforming pure importance-based methods which suffer significant accuracy degradation at low keep rates.

3.3 Ablation Studies

Experiments confirm that both importance and diversity components are essential. Removing either component leads to performance degradation, with diversity being particularly crucial at low keep rates.

4. Analysis Framework

Core Insight

The fundamental breakthrough here is recognizing that token diversity isn't just nice-to-have—it's non-negotiable for maintaining model expressivity during pruning. While everyone was chasing attention scores, this research exposes the critical flaw in purely importance-based approaches: they create echo chambers of similar high-attention tokens.

Logical Flow

The methodology follows an elegant three-step process: decouple based on attention, preserve critical local features, then strategically merge to maintain global context. This isn't incremental improvement—it's architectural rethinking that addresses the core tension between efficiency and representation capacity.

Strengths & Flaws

Strengths: The dual optimization objective is mathematically sound, the empirical results are compelling across architectures, and the approach elegantly bridges theoretical understanding with practical implementation. The fact that DeiT-T actually improves accuracy while reducing computation is remarkable.

Flaws: The clustering overhead isn't trivial, and the method assumes static importance scores which might not hold in dynamic inference scenarios. Compared to dynamic token selection methods like DynamicViT, there's potential latency trade-offs that need addressing.

Actionable Insights

For practitioners: Implement this approach immediately for any ViT deployment where computational budget matters. For researchers: The diversity preservation principle should become standard in all efficient transformer research—this could be the missing piece for making ViTs truly scalable.

5. Future Applications

This approach has significant implications for real-time vision applications, edge computing, and large-scale vision systems. The principles can extend beyond classification to object detection, segmentation, and video understanding tasks where computational efficiency is critical.

6. References

Vaswani et al. \"Attention Is All You Need\" (2017)
Dosovitskiy et al. \"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale\" (2020)
Liu et al. \"Swin Transformer: Hierarchical Vision Transformer using Shifted Windows\" (2021)
Wang et al. \"Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions\" (2021)

Table of Contents