Infrastructure

Flash-MoE: Running a 397B Parameter Model on a Laptop

C/Metal inference engine runs Qwen3.5-397B at 4.4+ tok/s on MacBook Pro by streaming weights from SSD in parallel, proving frontier-scale local inference is viable by eliminating the RAM bottleneck.

Monday, March 23, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline

A custom C/Metal inference engine that streams Qwen3.5-397B-A17B (397B parameter MoE) entirely from SSD on a MacBook Pro with 48GB RAM, achieving 4.4+ tok/s with full tool calling at 4-bit quantization — no Python, no ML frameworks. The key technique is on-demand expert weight streaming via parallel pread() from NVMe, sidestepping the RAM bottleneck of dense models. Demonstrates that production-quality local inference of frontier-scale models is achievable on consumer Mac hardware with careful systems engineering.

Read original at Hacker News

PExA: Parallel Exploration Agent for Complex Text-to-SQL

Parallel exploration agents solve complex text-to-SQL tasks by testing multiple strategies simultaneously, democratizing natural language access to databases.