Processing 200,000 tokens through a large language model is expensive and slow — but a new technique from researchers at Tsinghua University and Z.ai could change that calculus dramatically. Meet IndexCache, a sparse attention optimizer that delivers up to 1.82× faster inference on long-context AI models while eliminating 75% of redundant computation.
The DSA Bottleneck Problem
Large language models rely on self-attention, a mechanism that computes the relationship between every token and all preceding ones to predict the next token. The problem: its computational complexity scales quadratically with sequence length. For applications requiring extended context windows — think large document processing, multi-step agentic workflows, or long chain-of-thought reasoning — this quadratic scaling leads to sluggish inference speeds and soaring compute costs.
Sparse attention offers a principled solution. Instead of calculating every token relationship, sparse attention has each query select and attend to only the most relevant subset of tokens. DeepSeek Sparse Attention (DSA), introduced in DeepSeek-V3.2, implements this efficiently using a lightweight “lightning indexer module” at every model layer that scores all preceding tokens and selects a small batch for the main attention mechanism.
But the researchers identified a lingering flaw: the DSA indexer itself still operates at quadratic complexity at every single layer. At 200K context length, the indexer consumes a staggering 81% of prefill time.
IndexCache: Caching Attention Across Layers
To solve this, the research team discovered a crucial characteristic: the subset of important tokens an indexer selects remains remarkably stable as data moves through consecutive transformer layers. Empirical tests revealed that adjacent layers share between 70% and 100% of their selected tokens.
IndexCache partitions the model’s layers into two categories: Full (F) layers retain their indexers and actively score tokens, while Shared (S) layers skip the math entirely and reuse cached indices from the nearest preceding F layer. During inference, the model simply checks the layer type — if it reaches an F layer, it calculates fresh indices; if it is an S layer, it copies the cached data.
Real-World Performance Numbers
Applied to the 30-billion-parameter GLM-4.7 Flash model at 200K context length:
- Prefill latency: 19.5s → 10.7s (1.82× speedup)
- Decode throughput: 58 tokens/sec → 86 tokens/sec (1.48× speedup)
- Server memory saturated: up to 51% total decode throughput improvement
Remarkably, these efficiency gains don’t compromise reasoning capabilities. Using the training-free approach to eliminate 75% of indexers, the 30B model matched the original baseline on long-context benchmarks (49.9 vs 50.2). On AIME 2025 math reasoning, the optimized model actually outperformed the original: 92.6 vs 91.0.
Two Deployment Approaches
The researchers developed two approaches for implementing IndexCache:
Training-free (available now): A greedy layer selection algorithm automatically determines the optimal placement of F and S layers using a small calibration dataset. No weight updates required — patches are already available for SGLang and vLLM on GitHub.
Training-aware: For teams pre-training foundation models, a multi-layer distillation loss optimizes network parameters to natively support cross-layer sharing.
Enterprise Impact
For enterprise teams deploying long-context AI, IndexCache translates directly to cost savings. “In terms of ROI, the gains are most noticeable in long-context workloads such as RAG, document analysis, and agentic pipelines,” said Yushi Bai, co-author of the paper. The team observes at least 20% reduction in deployment cost and similar improvements in user-perceived latency.
The approach works with DeepSeek-V3.2 and GLM-5 models out of the box. Within 24 hours of the paper’s release, community members began porting it to popular local AI libraries. The technique is complementary to existing KV cache compression approaches — IndexCache eliminates computation redundancy rather than just shrinking memory footprint.
“Future foundation models will likely be architected with downstream inference constraints in mind from the beginning. This means designs that are not only scalable in terms of model size, but also optimized for real-world throughput and latency.” — Yushi Bai, co-author
As AI deployments scale to handle longer contexts and more complex reasoning tasks, IndexCache represents a principled way to reclaim wasted compute — without sacrificing accuracy, and without requiring teams to retrain their models from scratch.