Run models that don't fit in RAM on your Mac. $0/month.
| Your Mac | RAM | What you can run | Speed |
|---|---|---|---|
| Any Mac | 8 GB | Qwen3.5-9B (Q4_K_M, 5.3 GB), 4K context | 16-20 tok/s |
| Any Mac | 16 GB | Qwen3.5-9B (Q4_K_M, 5.3 GB), 64K context | 16-20 tok/s |
| Mac mini M4 | 16 GB | Qwen3.5-35B-A3B (IQ2_M, 10.6 GB) | 30 tok/s |
| Mac mini M4 | 16 GB | Qwen3-30B-A3B Q4 (17.2 GB) via Expert Sniper | 4.33 tok/s |
| Mac mini M4 | 16 GB | Qwen3.5-35B-A3B Q4_K_M (22 GB) via Flash Streaming | 1.54 tok/s |
| Mac mini M4 | 16 GB | Qwen3.5-27B (16.1 GB) via Flash Streaming | 0.18 tok/s |
| Mac mini M4 Pro | 48 GB | 35B at full Q4 in RAM | 30+ tok/s |
"I wanted to run the Qwen 27B on my M2 16GB but failed. That's not possible, right?"
It is possible. We stream FFN weights from SSD — only 5.5 GB stays in RAM. The output is coherent, full 4-bit quality. It's slow (0.18 tok/s on a Mac mini M4) but the method works on any 16 GB Apple Silicon Mac. No 2-bit compression, no mmap thrashing, no swap death. See how it works.
The fastest option. Uses llama.cpp with a 2-bit quantization (IQ2_M) that fits entirely in RAM.
brew install llama.cpp
pip3 install rich ddgs --break-system-packages
# Download model (10.6 GB)
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('unsloth/Qwen3.5-35B-A3B-GGUF',
'Qwen3.5-35B-A3B-UD-IQ2_M.gguf', local_dir='$HOME/models/')
"
# Start server + agent
llama-server \
--model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 12288 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -np 1 -t 4
python3 agent.pypython3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('unsloth/Qwen3.5-9B-GGUF',
'Qwen3.5-9B-Q4_K_M.gguf', local_dir='$HOME/models/')
"
llama-server \
--model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
--port 8000 --host 127.0.0.1 \
--flash-attn on --ctx-size 65536 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--n-gpu-layers 99 --reasoning off -t 4
python3 agent.pyRun models that genuinely don't fit in RAM at full quality. No 2-bit compression. No mmap thrashing.
Every number below was measured on a 16 GB Mac mini M4. Nothing estimated.
| Model | Total Size | RAM Used | Speed | Quality |
|---|---|---|---|---|
| Qwen3-32B (dense) | 18.4 GB | 4.5 GB | 0.15 tok/s | Full 4-bit |
| Qwen3.5-27B (dense hybrid) | 16.1 GB | 5.5 GB | 0.18 tok/s | Full 4-bit |
| Qwen3.5-35B-A3B (MoE) | 22 GB | 1.42 GB | 1.54 tok/s | Full Q4_K_M |
The MoE model is 10x faster because only 8 of 256 experts activate per token — we load only those 8 from SSD (~14 MB) instead of the full layer (~460 MB).
Split the model by access pattern:
Pinned in RAM (4-6 GB): Attention weights, embeddings, norms, KV cache. Loaded once, stays forever.
Streamed from SSD per token: FFN weights (the bulk of the model). Loaded layer-by-layer, used for one matmul, discarded. Memory never grows.
For each token:
For each layer:
1. Run attention (from RAM — instant)
2. Load FFN weights from SSD (~165-221 MB)
3. Run FFN matmul on GPU
4. Discard FFN weights — memory stays flat
For MoE models, step 2 loads only the 8 active experts (~14 MB), not all 256. That's why MoE is 10x faster.
Interactive agent with web search, shell commands, and chain-of-thought. The 22 GB model on a 16 GB Mac.
Requires pre-built stream files — see research/flash-streaming/ for the split/rebuild tools.
cd research/flash-streaming
python3 moe_agent.pycd research/flash-streaming
pip3 install mlx-lm transformers --break-system-packages
# One-time: download model (~16 GB) and split for streaming
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('mlx-community/Qwen3.5-27B-4bit', local_dir='$HOME/models/qwen35-27b-mlx-4bit')
"
python3 split_dense_27b.py
# Run
python3 flash_stream_27b.pyA research prototype that verifies 8 tokens in one forward pass. Instead of loading experts for each token separately, it computes the set union of active experts across all 8 tokens (~27 unique experts per layer instead of 64) and loads them once.
cd research/flash-streaming
python3 batched_moe.pyThis is verification speed (checking draft tokens), not generation speed. Useful for speculative decoding.
| Command | Action |
|---|---|
/agent |
Agent mode (default) — search, shell, chat |
/raw |
Direct streaming, no tools |
/search <q> |
Quick web search |
/stats |
Session statistics |
/clear |
Reset conversation |
/quit |
Exit |
| File | What it does |
|---|---|
agent.py |
Production agent — routes to search/shell/chat via llama.cpp |
chat.py |
Simple streaming chat with llama.cpp |
dashboard.py |
Real-time monitoring dashboard for llama.cpp |
setup.sh |
One-command install (llama.cpp + model download + config) |
config.example.json |
Example configuration |
web/ |
Web UI (server.py + index.html) |
| File | What it does |
|---|---|
mlx_engine.py |
MLX inference server with 64K context and KV cache persistence |
kv_cache.py |
KV cache save/load for session persistence |
paged_inference.py |
Paged attention experiment |
turboquant.py |
TurboQuant quantization experiments |
benchmark.py |
Benchmarking tools |
The research journey: we built, measured, and iterated. Each file represents a step.
| File | What it does | Key discovery |
|---|---|---|
flash_stream.py |
v1: mmap-based streaming (0.12 tok/s) | Split-model architecture works |
flash_stream_v2.py |
v2: F_NOCACHE direct I/O (0.15 tok/s) | 27% faster than mmap |
flash_stream_27b.py |
27B dense streaming (0.18 tok/s) | Method works on dense + hybrid SSM models |
flash_moe.py |
MoE expert-level streaming engine | Only load active experts from SSD |
moe_agent.py |
Working 35B agent (1.54 tok/s, 1.42 GB) | Coherent 22 GB model on 16 GB Mac |
batched_moe.py |
Batched Union-of-Experts (5.1 tok/s) | ~27 unique experts/layer, not 64 |
expert_io.py |
F_NOCACHE + pread expert reader (8 threads) | Saturate NVMe queue depth |
direct_io.py |
F_NOCACHE + pread for dense FFN layers | Bypass macOS Unified Buffer Cache |
split_mlx_model.py |
Split 35B MoE into pinned + per-layer experts | 16KB alignment for DART IOMMU |
split_dense_27b.py |
Split 27B dense into pinned + per-layer FFN | Same technique, different architecture |
convert_split.py |
GGUF → split safetensors conversion | GGUF is column-major |
convert_aligned.py |
Safetensors → 16KB-aligned binary | Required for F_NOCACHE direct I/O |
dequant_gguf.py |
Custom Q4_K/Q6_K dequantization (numpy) | MLX can't read GGUF Q4_K blocks |
rebuild_pinned.py |
Rebuild pinned weights from MLX golden model | Fix SSM weight dtype issues |
flash_agent.py |
32B dense streaming agent (early version) | Proof of concept |
flash_stream_batched.py |
Batched eval experiment | Proved eval sync isn't the bottleneck |
README.md |
Detailed research writeup with all measurements | Full methodology and results |
These are things we learned the hard way. Each links to the file where it was discovered/fixed.
-
GGUF is column-major —
flat.reshape(ne[1], ne[0]), not.reshape(ne[0], ne[1]).T. The wrong reshape gives correct shapes but garbage output. (dequant_gguf.py,convert_split.py) -
MLX 4-bit is 15% larger than expected — scales + biases at group_size=64 add 0.031 bytes/param. A 32B model is 18.4 GB, not 16 GB. This is why the model doesn't fit in 16 GB RAM even at 4-bit. (
research/flash-streaming/README.md) -
nn.quantize()silently skips MoE experts —SwitchLinearis not a subclass ofnn.Linear. You must pass aclass_predicatethat explicitly includes it. Without this, experts run in float16 and produce garbage. (moe_agent.py) -
gather_qmmeliminates accumulator divergence — 8 separatequantized_matmulcalls compound rounding errors across 40 layers. One batchedgather_qmmcall matches the reference model exactly. (batched_moe.py,flash_moe.py) -
F_NOCACHE is 27% faster than mmap — macOS Unified Buffer Cache adds overhead for sequential streaming workloads.
fcntl(F_NOCACHE)+os.pread()with 16KB alignment bypasses it entirely. (direct_io.py,expert_io.py) -
setattronnn.Moduleleaks memory — Injecting FFN weights into the model tree viasetattrprevents garbage collection. Memory grew 3.6 GB per 16 layers. Fix: usemx.quantized_matmuldirectly on loaded arrays, never touch the model tree. (flash_stream.py) -
Batching layers doesn't help — We tested 8-layer batches (16 evals vs 128). Zero speedup. The bottleneck is SSD I/O, not GPU sync overhead. (
flash_stream_batched.py)
┌──────────────────────────────────────────────┐
│ agent.py — LLM-as-Router │
│ search / shell / chat │
├──────────┬───────────────────────────────────┤
│ llama.cpp│ MLX backend │
│ (fast) │ + KV cache save/load │
│ │ + Flash Streaming (out-of-core) │
│ │ + MoE Expert Sniper (SSD) │
├──────────┴───────────────────────────────────┤
│ Apple Silicon — Unified Memory + NVMe SSD │
└──────────────────────────────────────────────┘
- Qwen3.5 — the models
- llama.cpp — inference engine
- MLX — Apple's ML framework
- Unsloth — GGUF quantizations
- mlx-community — pre-converted MLX models
MIT