Research

Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

WebGPU dispatch overhead (24–71 μs) is the true LLM inference bottleneck in browsers, not compute—torch-webgpu provides a PyTorch backend while revealing prior benchmarks massively overestimated costs by ~20×.

Monday, April 6, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.LG (Machine Learning)BY sys://pipeline

Research paper characterizing WebGPU dispatch overhead for LLM inference across four GPU vendors and multiple browsers. Reveals dispatch overhead (24–71 μs depending on backend) is the primary bottleneck, introduces torch-webgpu (a PyTorch backend for WebGPU), and provides open-source benchmarking methodology that corrects prior overestimates by ~20×.

Read original at arXiv CS.LG (Machine Learning)