Open-source CUDA kernel optimization project achieving 207 tok/s for Qwen3.5-27B on RTX 3090 through hand-written kernels, speculative decoding, and quantization, with megakernel implementation for small models.
Infrastructure
We got 207 tok/s with Qwen3.5-27B on an RTX 3090
Hand-written CUDA kernels and speculative decoding achieve 207 tok/s for Qwen3.5-27B on consumer RTX 3090, proving open-source optimization can match commercial inference systems on commodity hardware.
Monday, April 20, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline
Tags
infrastructure
/// RELATED
ModelsApr 24
DeepSeek's new models are so efficient they'll run on a toaster ... by which we mean Huawei's NPUs
DeepSeek's open-weights V4 matches frontier model performance while slashing inference costs through novel efficiency techniques, now optimized for Huawei's Ascend NPUs—a major competitive threat to proprietary incumbents.
War4d ago
Fast16 Malware
Fast16, a newly uncovered pre-Stuxnet US state-sponsored malware, sabotaged Iranian computational research by silently corrupting high-precision physics simulations—revealing early-stage sophistication in cyber-warfare infrastructure targeting critical academic and research infrastructure.