Qwen 3.5-9B at Q4_K_S quant runs at ~40 tokens/sec with thinking, tool use, and 128K context on a 24GB M4 MacBook Pro via LM Studio.
Key Takeaways
Larger fits like Qwen 3.6 Q3, GPT-OSS 20B, and Devstral Small 24B technically fit in 24GB but were unusable in practice; Qwen 3.5-9B Q4 is the sweet spot.
LM Studio requires manually injecting {%- set enable_thinking = true %} into the prompt template to enable Qwen thinking mode.
Recommended inference params for coding: temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, repetition_penalty=1.0.
Works with both pi and OpenCode via LM Studio’s OpenAI-compatible endpoint at localhost:1234; OpenCode config sets context_length=131072, max_tokens=32768.
Best used interactively with step-by-step guidance; one-shot complex app generation fails and pegs the CPU without useful output.
Hacker News Comment Review
Confusion arose over whether a 24GB M4 exists; confirmed it does via Apple’s spec page, though the base M4 MacBook Pro ships with less.
The article actually states the ~40 tokens/sec figure, so the request to add token speed data was already answered in the post.