Google's TurboQuant Achieves 6x Memory Compression for LLMs with Near-Zero Accuracy Loss

Back to News

Google Research has unveiled TurboQuant at ICLR 2026, a memory compression breakthrough that could fundamentally change how large language models handle long context windows. The algorithm achieves at least a 6x reduction in memory footprint with near-zero accuracy loss — and requires no retraining whatsoever.

The KV Cache Problem

Every time a large language model processes a long conversation or document, it builds up a Key-Value (KV) cache — a growing store of intermediate computations that the model needs to reference. For models with massive context windows (100K+ tokens), this cache can consume tens of gigabytes of GPU memory, creating a hard ceiling on what the model can process.

TurboQuant directly attacks this bottleneck by compressing KV cache entries from 16-bit or 32-bit representations down to approximately 3 bits per value.

How It Works

TurboQuant uses an elegant two-stage approach:

PolarQuant — Applies a randomized Hadamard transform to spread vector energy uniformly across all coordinates, making them easier to quantize. This produces a predictable distribution that can be compressed efficiently.
QJL Correction — A 1-bit Quantized Johnson-Lindenstrauss transform acts as an error-correction layer, ensuring that inner product estimates remain accurate and unbiased after compression.

The result is a method backed by theoretical proofs showing it operates near fundamental lower bounds for quantization distortion.

Performance Numbers

On NVIDIA H100 GPUs, TurboQuant delivers:

6x memory reduction for KV cache storage
8x speedup in computing attention logits (4-bit mode vs. uncompressed 32-bit)
Zero additional memory overhead — unlike traditional vector quantization, no extra bits are needed for quantization constants
Training-free deployment — works on any transformer architecture without fine-tuning or calibration datasets

Practical Impact

The implications are significant for anyone running LLM inference at scale:

Models can handle much longer context windows without hitting memory walls
Inference costs drop substantially as memory-bound operations become faster
Existing models can be compressed and deployed without any retraining investment

For cloud providers and enterprises running millions of inference requests per day, this translates directly into lower hardware costs and higher throughput.

Beyond LLMs

TurboQuant also applies to vector search engines, where it can compress high-dimensional embeddings for faster similarity lookups and reduced index sizes. This makes it relevant for retrieval-augmented generation (RAG) systems and recommendation engines.

Source: research.google, iclr.cc, helpnetsecurity.com