Autoresearching Apple's "LLM in a Flash" to run Qwen 397B locally

Dan Woods ran the 209GB Qwen3.5-397B MoE model at 5.5+ tokens/second on a 48GB MacBook M3 Max by streaming expert weights from SSD using Apple's "LLM in a Flash" technique. He used Claude Code with Karpathy's autoresearch pattern to autonomously run 90 experiments and generate optimized MLX Objective-C and Metal code, with Claude Opus 4.6 authoring most of the resulting paper. The final setup uses 4-bit expert quantization with only 5.5GB resident in RAM — a compelling proof of concept for agentic AI-driven systems research.