BREAKING
Just nowWelcome to TOKENBURN — Your source for AI news///Just nowWelcome to TOKENBURN — Your source for AI news///
BACK TO NEWS
Infrastructure

Flash-MoE: Running a 397B Parameter Model on a Laptop

C/Metal inference engine runs Qwen3.5-397B at 4.4+ tok/s on MacBook Pro by streaming weights from SSD in parallel, proving frontier-scale local inference is viable by eliminating the RAM bottleneck.

Monday, March 23, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline

A custom C/Metal inference engine that streams Qwen3.5-397B-A17B (397B parameter MoE) entirely from SSD on a MacBook Pro with 48GB RAM, achieving 4.4+ tok/s with full tool calling at 4-bit quantization — no Python, no ML frameworks. The key technique is on-demand expert weight streaming via parallel pread() from NVMe, sidestepping the RAM bottleneck of dense models. Demonstrates that production-quality local inference of frontier-scale models is achievable on consumer Mac hardware with careful systems engineering.

Tags
infrastructure
/// RELATED