KV Cache
5 mentions across all digests
KV Cache is an inference optimization technique that stores intermediate key/value attention computations to avoid recomputing them on each token generation step in language models, with research extending it to zero-token knowledge injection.
TurboQuant: A First-Principles Walkthrough
TurboQuant compresses LLM KV caches to 2–4 bits per coordinate using training-free random rotation, enabling practical memory efficiency gains without calibration overhead.
High-Fidelity KV Cache Summarization Using Entropy and Low-Rank Reconstruction
Entropy-aware KV cache summarization reduces VRAM overhead for million-token LLM contexts while preserving semantic fidelity through low-rank reconstruction, enabling longer context windows without pruning.
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Researchers propose probabilistic language tries for KV cache compression that exceed theoretical per-vector limits, potentially reducing inference memory footprint and compute costs for LLM deployment.
Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection
Knowledge Packs inject external knowledge into language models through KV cache without consuming tokens, reducing inference costs for knowledge-augmented tasks.
Understanding and Coding the KV Cache in LLMs from Scratch
KV caches explained: the memory-vs-latency tradeoff that powers efficient LLM inference, from conceptual foundations to working Python code.