SGLang
5 mentions across all digests
SGLang is an open-source LLM inference framework supported by NVIDIA Dynamo and compatible with the Hugging Face Transformers v5 ecosystem, used for high-performance model serving alongside vLLM and TRT-LLM.
Building the foundation for running extra-large language models
Cloudflare demonstrates 3x performance gains for LLM inference by disaggregating prefill and decode compute stages and optimizing KV cache management with prompt caching, enabling efficient multi-GPU scaling on Workers AI.
The M×N problem of tool calling and open-source models
Each of M open-source inference frameworks (vLLM, SGLang, TensorRT-LLM) must independently reverse-engineer and maintain tool-calling parsers for N incompatible model formats, creating unsustainable M×N maintenance burden that standardized declarative specs could eliminate.
Introspective Diffusion Language Models
Introspective Diffusion Language Models enable parallel token generation with 2.9-4.1x speedup—an 8B model beats a 16B baseline by 26 points on AIME-24 without custom serving changes.
NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)
Transformers v5: Simple model definitions powering the AI ecosystem