Running Local LLMs: From First Run to Fine-Tuned

⬅️ Back to Projects

I clipped this thread by Michael Guo (@michaelzsguo) because it solves a specific pain: when a local model won’t run well, most people blame the model. Guo argues you’re probably stuck on one of five layers, and each one has its own fix.

His framework goes from zero to tuned across five layers: hardware bandwidth, memory math, runtime choice, model selection, and quantization. Published May 9, 2026.

The Five Layers

Hardware. The reflex is to look at GPU compute (FLOPS). For local inference, that instinct is wrong. Inference is memory-bandwidth-limited, not compute-limited. Every token reads the full model weights from memory to the compute units. Bandwidth determines how fast that read happens, which caps your tokens per second. Apple Silicon’s unified memory architecture gives it an edge here - the memory-to-compute path is shorter than discrete GPU over PCIe. M4 Max hits 546 GB/s on the high tier. M3 Ultra hits 819 GB/s. RTX 4090 is ~1008 GB/s for comparison, but you’re stuck at 24GB VRAM.

Memory. Two numbers matter: weights and KV Cache. Weights are straightforward: 8B parameters at FP16 is 16GB. Q4 quantization drops that to ~5GB. The KV Cache is the hidden variable. It scales linearly with context size. Running 128K context instead of 4K is 32x the KV Cache. The most common beginner mistake is assuming memory is fixed once the model loads. It isn’t. Context size moves the needle most.

Runtime. Most people are really using Ollama before they’re using the model. Ollama decides how the model loads, which chat template applies, how many layers go to GPU. For more control, llama.cpp lets you set everything explicitly: context size, GPU layers, Flash Attention, chat template. The --jinja flag is worth remembering - it tells llama.cpp to use the model’s bundled chat template instead of the default, which fixes a lot of mysterious quality problems.

Model selection. The GGUF filename tells you everything: Qwen3-30B-A3B-Q4_K_M.gguf breaks down to model family (Qwen3), total parameters (30B), active parameters (3B for MoE), quantization type (Q4_K_M). Base vs Instruct vs Chat matters enormously. Instruct models are fine-tuned for instruction following but are sensitive to prompt format - wrong template produces mysteriously bad output that gets misdiagnosed as “this model is bad.”

Quantization. K-quants (the K in Q4_K_M) use K-means clustering and consistently outperform naive uniform quantization at the same bit width. Practical rule: memory to spare, use Q8. Quality-sensitive with limited memory, Q5_K_M or Q6_K_M. Everyday use, Q4_K_M. Memory forces your hand, Q3. A counterintuitive result: a Q4 8B model outperforms an FP16 3B model on most tasks. Choosing a bigger model at reasonable quantization beats choosing a smaller model at full precision.

What I Found Most Useful

The section on Mixture of Experts cleared up something I’d been fuzzy on. MoE models like Qwen3-30B-A3B have total parameters (30B) that determine memory requirements, and active parameters (~3B) that determine compute per inference. You load all 30B weights into memory but only compute on ~3B per pass. The speed isn’t literally 3B speed (bandwidth still matters for reading 30B weights), but it’s meaningfully faster than a dense 30B.

Also worth remembering: for local agent use and tool calling, latency and instruction-following stability often matter more than benchmark rankings. A responsive 14B that reliably follows instructions beats a slow 70B for day-to-day work.

Links

Crepi il lupo! 🐺