IndexCache: The Sparse Attention Breakthrough Delivering 1.82x Faster AI Inference

Processing long contexts remains one of the most computationally expensive operations in modern AI. A 200,000-token input can take nearly twenty seconds to process through a large language model??nacceptable latency for production applications. Researchers at Tsinghua University and Z.ai have developed IndexCache, a technique that cuts this processing time nearly in half while maintaining output quality.

The Self-Attention Bottleneck

Large language models rely on self-attention mechanisms to understand relationships between tokens in their input. This process scales quadratically with sequence length: double the context, quadruple the computation. For applications requiring extended context windows??ocument analysis, agentic workflows, complex reasoning??he costs spiral rapidly.

Sparse attention offers a principled solution. Instead of calculating relationships between every token and all preceding ones, sparse attention mechanisms select and attend only to the most relevant subset of tokens. DeepSeek Sparse Attention (DSA) implements this through a lightweight “lightning indexer module” at each layer, dramatically reducing the heavy attention computation.

Yet the researchers identified a critical inefficiency: the DSA indexer itself still operates at quadratic complexity at every layer. As context lengths grow, the cumulative time spent running these indexers creates a significant bottleneck, especially during the initial prefill stage.

IndexCache: Caching Attention Indices Across Layers

The solution emerged from an elegant observation about how DSA models process data. When examining adjacent transformer layers, the researchers discovered that token selections remain remarkably stable??haring between 70% and 100% of selected tokens across consecutive layers.

This cross-layer redundancy presented an opportunity. IndexCache partitions layers into two categories: full (F) layers retain their indexers and actively compute token selections, while shared (S) layers skip indexing entirely and reuse cached indices from the nearest preceding F layer.

Impressive Performance Gains

Testing on the 30-billion-parameter GLM-4.7 Flash model at 200,000-token context length:

Prefill latency: Reduced from 19.5 seconds to 10.7 seconds (1.82x speedup)
Decoding throughput: Increased from 58 to 86 tokens per second (1.48x speedup)
Server throughput: Up to 51% improvement when memory is saturated

These gains come while maintaining accuracy. Using the training-free approach to eliminate 75% of indexers, the optimized model scored 49.9 on long-context benchmarks versus the original 50.2. On the challenging AIME 2025 math reasoning benchmark, it actually outperformed the original??2.6 versus 91.0.

Deployment Flexibility

IndexCache offers two deployment approaches. For teams using off-the-shelf DSA models, the training-free method uses a “greedy layer selection” algorithm that automatically determines optimal layer placement. For organizations pre-training foundation models, the training-aware approach introduces multi-layer distillation loss.

Open-source patches are already available on GitHub for major serving engines including vLLM and SGLang.

Enterprise Impact

“In terms of ROI, IndexCache provides consistent benefits across scenarios, but the gains are most noticeable in long-context workloads such as RAG, document analysis, and agentic pipelines,” said Yushi Bai, co-author of the research paper. “In these cases, we observe at least an approximate 20% reduction in deployment cost.”

Future Directions

IndexCache represents the kind of innovation the industry needs: practical improvements that work with existing infrastructure while pointing toward more efficient AI systems ahead.

Featured image: VentureBeat coverage of IndexCache research from Tsinghua University and Z.ai