IndexCache: A New Sparse Attention Technique That Makes Long-Context AI Inference 1.82x Faster
A new sparse attention technique called IndexCache promises 1.82x faster inference on long-context AI models by caching repeated token selections across adjacent layers — and the community is already porting it to…