Ask any engineer running large language models what their biggest operational headache is, and memory will come up fast. The key-value cache 鈥?the fast-access memory structure that lets LLMs recall what they have already processed 鈥?grows with every token of context. Feed a model a million-token document, and you need enough memory to cache a million tokens worth of attention states. That is expensive, slow, and increasingly a bottleneck as models are asked to handle longer and longer contexts.
Google thinks it has a solution. A team from Google Research and Google DeepMind has published TurboQuant, a compression algorithm that shrinks key-value cache memory by at least six times 鈥?with zero accuracy loss. It works on standard hardware, requires no fine-tuning, and in some configurations runs faster than the uncompressed version. The paper is being presented at ICLR 2026.
The Problem With Quantization Overhead
Vector quantization is a classical technique for compressing high-dimensional data. The idea is straightforward: map a large set of continuous values to a smaller discrete set, dramatically reducing the memory needed to store them. In AI, it is used to compress the key-value pairs that models use during attention.
The catch is that most quantization methods introduce their own memory overhead. To reconstruct data from compressed form, you need to store quantization constants 鈥?parameters that describe the mapping from compressed values back to their original precision. For every small block of data, you need these constants in full precision. That overhead can add one or two extra bits per number, partially defeating the purpose of compression.
TurboQuant solves this with a two-stage approach. The first stage uses a method called PolarQuant to handle most of the compression. The second stage uses a one-bit trick called QJL 鈥?the Johnson-Lindenstrauss Transform 鈥?to eliminate the residual error. The result is high-quality compression with no overhead.
How TurboQuant Works: PolarQuant and QJL
The PolarQuant stage starts by randomly rotating the data vectors. This rotation simplifies the geometry of the data in a way that makes a standard quantizer apply cleanly to each component. Think of it like switching from Cartesian coordinates 鈥?which describe a point as a distance along each of several axes 鈥?to polar coordinates, which describe it as a radius and an angle. In polar form, the data structure becomes predictable and concentrated, which means the quantizer does not need to carry extra normalization constants for every block. Those constants are what cause the memory overhead in traditional methods. PolarQuant eliminates them by exploiting the regular geometry of the rotated space.
The QJL stage handles what is left over 鈥?the tiny residual error from the first stage. QJL reduces each remaining number to a single sign bit using the Johnson-Lindenstrauss Transform, which preserves the essential distance relationships between data points even at extreme compression. It acts as a mathematical error-checker that removes bias from the attention score calculation.
Together, these two stages allow TurboQuant to compress key-value cache entries to just 3 bits per value with no accuracy degradation.
Real Benchmarks, Real Gains
Google tested TurboQuant across standard long-context benchmarks including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, using open-source models Gemma and Mistral. The results are consistent: TurboQuant matches or exceeds the accuracy of uncompressed baseline models across every benchmark.
On the needle-in-a-haystack test 鈥?which asks a model to find a single specific piece of information buried in a massive context window 鈥?TurboQuant achieves perfect downstream accuracy at a 6x compression factor. The same result holds across tasks including question answering, code generation, and summarization.
The speed numbers are equally striking. On H100 GPU accelerators, 4-bit TurboQuant achieves up to 8x faster attention computation compared to 32-bit uncompressed keys. The algorithm is efficient to implement and introduces negligible runtime overhead 鈥?in many cases, the compressed computation is faster than the original because there is simply less data to move around.
What 6x Compression Actually Means in Practice
To appreciate the significance, consider what happens when you deploy a model with a long context window. A 70-billion-parameter model processing a 128,000-token context needs to store attention states for all 128,000 tokens. In 32-bit floating point, that is a substantial amount of memory 鈥?often requiring tens of gigabytes just for the KV cache. It is why long-context models often hit memory walls before they hit compute walls.
At 6x compression, the same cache fits in roughly one-sixth the memory. A deployment that previously needed 60GB of KV cache memory now needs 10GB. That is the difference between a model that runs on a single high-end GPU and one that requires a multi-GPU server. It is also the difference between a system that can maintain a long context in memory and one that has to chunk and re-load, losing the coherence that long contexts are supposed to provide.
For Google, this matters directly for Gemini. The company is already deploying techniques like TurboQuant to make its production models more efficient. For the broader AI community, the algorithm is available as a research paper and will presumably find its way into open-source libraries quickly 鈥?the team has noted that community members ported it to popular local AI libraries within 24 hours of the paper release, including MLX for Apple Silicon and llama.cpp.
Beyond KV Cache: The Vector Search Angle
TurboQuant, QJL, and PolarQuant are not just about making LLMs faster. They are fundamental algorithmic contributions to the problem of approximate nearest neighbor search 鈥?the core operation behind semantic search, recommendation systems, and retrieval-augmented generation.
Modern search is evolving beyond keyword matching toward understanding intent and meaning. That requires vector search: finding the most semantically similar items in a database of billions of vectors. TurboQuant achieves this data-obliviously, meaning it works without knowing anything about the specific dataset being indexed, making it broadly applicable to any high-dimensional search task.
The Bottom Line
Memory bottlenecks are one of the defining constraints of the current AI era. As models grow, as context windows expand, and as deployment scenarios multiply, the ability to store and retrieve more information in less memory is not a nice-to-have 鈥?it is foundational infrastructure. TurboQuant demonstrates that extreme compression with zero accuracy loss is achievable with the right mathematical approach. The open-source community is already running with it. Expect to see it embedded in AI frameworks everywhere within months.