Researchers at Tsinghua University and Z.ai have developed a new technique called IndexCache that eliminates up to 75% of redundant computation in sparse attention models, delivering up to 1.82x faster time-to-first-token and 1.48x faster generation throughput for long-context AI workloads.
The technique specifically targets models using DeepSeek Sparse Attention architecture, including the latest DeepSeek and GLM families. For enterprises running production-scale long-context models, IndexCache offers a path to significantly faster user experiences without requiring new hardware investments.
The Self-Attention Bottleneck
Large language models rely on the self-attention mechanism to predict the next token by computing relationships between every token in context and all preceding ones. However, this approach has a fundamental limitation: computational complexity scales quadratically with sequence length.
For applications requiring extended context windows ??such as large document processing, multi-step agentic workflows, or long chain-of-thought reasoning ??this quadratic scaling leads to sluggish inference speeds and significant compute and memory costs.
Sparse attention offers a principled solution. Instead of calculating relationships between every token and all preceding ones, sparse attention optimizes the process by having each query select and attend to only the most relevant subset of tokens.
DeepSeek Sparse Attention Architecture
DeepSeek Sparse Attention, first introduced in DeepSeek-V3.2, is a highly efficient implementation of sparse attention. To determine which tokens matter most, DSA introduces a lightweight lightning indexer module at every layer of the model. This indexer scores all preceding tokens and selects a small batch for the main core attention mechanism to process.
By doing this, DSA slashes the heavy core attention computation from quadratic to linear, dramatically speeding up the model while preserving output quality. But the researchers identified a lingering flaw: the DSA indexer itself still operates at quadratic complexity at every single layer.
Even though the indexer is computationally cheaper than the main attention process, as context lengths grow, the time the model spends running these indexers skyrockets. This severely slows down the model, especially during the initial prefill stage where the prompt is first processed.
How IndexCache Works
To solve the indexer bottleneck, the research team discovered a crucial characteristic of how DSA models process data: the subset of important tokens an indexer selects remains remarkably stable as data moves through consecutive transformer layers. Empirical tests revealed that adjacent layers share between 70% and 100% of their selected tokens.
To capitalize on this cross-layer redundancy, the researchers developed IndexCache. The technique partitions the model’s layers into two categories. A small number of full (F) layers retain their indexers, actively scoring tokens and choosing the most important ones to cache. The rest of the layers become shared (S), performing no indexing and reusing the cached indices from the nearest preceding F layer.
During inference, the model simply checks the layer type. If it reaches an F layer, it calculates and caches fresh indices. If it is an S layer, it skips the computation and copies the cached data.
Real-World Performance
To test the impact of IndexCache, the researchers applied it to the 30-billion-parameter GLM-4.7 Flash model and compared it against the standard baseline. At a 200K context length, removing 75% of the indexers slashed the prefill latency from 19.5 seconds down to just 10.7 seconds, delivering a 1.82x speedup.
During the decoding phase, where the model generates its response, IndexCache boosted per-request throughput from 58 tokens per second to 86 tokens per second at the 200K context mark, yielding a 1.48x speedup. When the server’s memory is fully saturated with requests, total decode throughput jumped by up to 51%.
For enterprise teams, these efficiency gains translate directly into cost savings. In terms of ROI, IndexCache provides consistent benefits across scenarios, but the gains are most noticeable in long-context workloads such as RAG, document analysis, and agentic pipelines, said Yushi Bai, co-author of the paper. In these cases, we observe at least an approximate 20% reduction in deployment cost and similar improvements in user-perceived latency.
Accuracy Preservation
Remarkably, these efficiency gains did not compromise reasoning capabilities. Using the training-free approach to eliminate 75% of indexers, the 30B model matched the original baseline’s average score on long-context benchmarks, scoring 49.9 against the original 50.2. On the highly complex AIME 2025 math reasoning benchmark, the optimized model actually outperformed the original baseline, scoring 92.6 compared to 91.0.
The team also ran preliminary experiments on the production-scale 744-billion-parameter GLM-5 model. They found that eliminating 75% of its indexers with the training-free method yielded at least a 1.3x speedup on contexts over 100K tokens, while maintaining nearly identical quality average on long-context tasks.
Getting Started
For development teams wanting to implement the training-free approach today, the process is straightforward but requires careful setup. While the greedy search algorithm automatically finds the optimal layer configuration, the quality of that configuration depends on the data it processes.
We recommend using domain-specific data as a calibration set so that the discovered layer-sharing pattern aligns with real workloads, Bai advised.
Open-source patches are already available on GitHub for major serving engines. Integration is relatively straightforward ??developers can apply the patch to existing inference stacks such as vLLM or SGLang and enable IndexCache with minimal configuration changes.
IndexCache represents an immediate solution for today’s compute bottlenecks, but its underlying philosophy points to a broader shift in how the AI industry will approach model design. Future foundation models will likely be architected with downstream inference constraints in mind from the beginning, Bai concluded. This means designs that are not only scalable in terms of model size, but also optimized for real-world throughput and latency.