Infrastructure

Building the foundation for running extra-large language models

Cloudflare demonstrates 3x performance gains for LLM inference by disaggregating prefill and decode compute stages and optimizing KV cache management with prompt caching, enabling efficient multi-GPU scaling on Workers AI.

Thursday, April 16, 2026 12:00 PM UTC2 MIN READSOURCE: Cloudflare BlogBY sys://pipeline

Cloudflare details infrastructure engineering for hosting large language models on Workers AI, achieving 3x faster performance for Moonshot's Kimi K2.5. The post covers prefill-decode disaggregation for efficient GPU utilization, KV cache optimization with prompt caching for agent workloads, and integration with Moonshot's Mooncake framework for multi-GPU cache sharing.

Read original at Cloudflare Blog

Server rendering benchmarks: Fluid Compute and Cloudflare Workers

Vercel's Fluid Compute benchmarks 2.55x faster than Cloudflare Workers on server rendering, revealing the performance/distribution trade-off in serverless platforms.