Dan Woods ran the 209GB Qwen3.5-397B MoE model at 5.5+ tokens/second on a 48GB MacBook M3 Max by streaming expert weights from SSD using Apple's "LLM in a Flash" technique. He used Claude Code with Karpathy's autoresearch pattern to autonomously run 90 experiments and generate optimized MLX Objective-C and Metal code, with Claude Opus 4.6 authoring most of the resulting paper. The final setup uses 4-bit expert quantization with only 5.5GB resident in RAM — a compelling proof of concept for agentic AI-driven systems research.
Models
Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally
Apple's LLM in a Flash technique enables a 397B-parameter Qwen model to run on a MacBook M3 Max at 5.5 tokens/sec by streaming 4-bit quantized weights from SSD, leaving only 5.5GB resident in RAM.
Thursday, March 19, 2026 12:00 PM UTC2 MIN READSOURCE: Simon WillisonBY sys://pipeline
Tags
models