AI Economics

Local AI coding showdown on a 36 GB Mac: Gemma vs Qwen vs North

Last updated 2026-06-13 · ByteCosts

Local AI coding showdown on a 36 GB Mac: Gemma vs Qwen vs North explains Three 4-bit local coding models on an M4 Max, graded by a harness that executes their code. Measured speeds, MTP tuning, a 15-task gauntlet, BFCL and LiveCodeBench runs, and three pelicans. This ByteCosts research article explains the cost mechanics behind the headline, turns the pattern into budgeting questions, and points readers toward calculators that can model the same issue with their own workload. Read it when you need a finance-readable explanation of AI Economics before choosing a model, cloud platform, subscription, or optimization path. The static HTML includes the summary, article body, tables, related tools, and citation before JavaScript runs.

Apply this concept - Prompt cache savings calculator: break-even savings →

Summary

No cloud. No API key. Three 4bit models, Gemma 4, Qwen3.6, and Cohere's NorthMiniCode, running on an M4 Max via llama.cpp, graded by a harness that actually executes their code. Every number below was measured on the machine, not copied from a benchmark table.

Published on ByteCosts: know what AI actually costs before the invoice.

The result, up front 🟠 NorthMiniCode (Cohere, 30BA3B), the dedicated coding model, went 15/15, the only one to write a working quine (s='s=%r;print(s%%s)';print(s%s)), and was the fastest endtoend. Then it drew a pelican standing next to the bicycle instead of riding it. 🟢 Gemma 4 (26BA4B), fastest tokens (106 t/s with MTP), multimodal, 14/15. Asked for a quine, it generated 16,384 tokens of reasoning and never finished. 🔵 Qwen3.6 (35BA3B), drew the best pelican (in motion, on the bike), 14/15, but the slowest by far, and its quine ran fine and was simply wrong. Plus two gotchas that ate hours: xet downloads were ~10× slower than plain HTTPS, and mmap off an exFAT disk hangs llama.cpp for ~50 minutes (fix: nommap). It's all reproducible below, build, models, benchmarks, the gauntlet, and the pelicans.

Same prompt, three local models, one laptop. Left to right: balanced, best, and "wait, the bird fell off." Full story below.

Article body

Published on ByteCosts: know what AI actually costs before the invoice.

Same prompt, three local models, one laptop. Left to right: balanced, best, and "wait, the bird fell off." Full story below.

Key concepts, in plain terms

New to local LLMs? Here is the jargon this piece uses, in one place. Skip to the machine if these are already familiar.

llama.cpp: opensource C/C++ engine that runs LLMs on your own hardware (Apple's Metal GPU on a Mac). It serves all three models here, with no cloud. GGUF: the singlefile format llama.cpp loads a model from (weights plus metadata). Quantization (4bit, Q4KXL): storing the model's numbers at lower precision (4 bits instead of 16) so a model that would otherwise need roughly 50 GB fits in 16 to 21 GB, for a small accuracy cost. UDQ4KXL is Unsloth's tuned 4bit recipe. MoE, "26BA4B" (active params): Mixture of Experts. The model holds 26B parameters but routes each token through only about 4B of them, so it computes like a 4B model while knowing like a 26B one. That is why it runs fast on a laptop. MTP / speculative decoding: a lossless speedup. A small fast draft proposes the next few tokens and the big model verifies them at once; accepted guesses are free. The output is identical to running without it. KV cache: the attention state the model keeps while generating. It grows with the context window and consumes RAM, which is why quantizing it (q8) buys headroom. mmap (nommap): by default llama.cpp memorymaps the model file and pages it in on demand; on an exFAT disk that is pathologically slow, so nommap reads it into RAM once. Reasoning model: a model that emits hidden "thinking" tokens before its answer; llama.cpp returns that thinking separately from the final reply. Multimodal projector (mmproj): a small addon that turns image pixels into tokens, giving a text model vision so it can read screenshots. quine: a program that prints its own exact source code, a classic selfreferential trap. The pelican test: Simon Willison's informal eval, "draw an SVG of a pelican riding a bicycle," which probes spatial and creative reasoning. BFCL (Berkeley FunctionCalling Leaderboard): a benchmark for tool and function calling, picking the right tool and filling its arguments. LiveCodeBench: a contaminationfree coding benchmark; its codeexecution task asks the model to predict a snippet's exact output.

The machine table


Chip	Apple M4 Max
Unified memory	36 GB
macOS	26.5 (build 25F71)
Model storage	external SSD at /Volumes/t800, exFAT, 940 GB free
llama.cpp build dir	internal APFS (~/Development/modelproject)

The machine

Note the split: GGUFs live on the big external exFAT SSD; llama.cpp itself is built and run from the internal disk. That split matters, see the exFAT gotcha below.

What we're building

llama.cpp built with Metal Gemma 4 26BA4B (MoE: 26B total, ~4B active) in GGUF, Unsloth UDQ4KXL An MTP draft model for speculative decoding (spectype draftmtp) The multimodal projector so the model can read screenshots Pi as the terminal coding agent, talking to llama.cpp's OpenAIcompatible server Then two more contenders, Qwen3.6 35BA3B and Cohere's NorthMiniCode 30BA3B, for a threeway headtohead: the same 16 challenges, each graded by a harness that runs the code

The full Gemma walkthrough (Steps 17) is the reproducible recipe; Qwen and North reuse it with the deltas called out. Skip to the showdown if you just want the verdict.

Step 1: Build llama.cpp (Metal)

Built against commit 57fe1f0 (20260613). cmake picked up Metal + Accelerate. The binaries we use: build/bin/llamacli, llamaserver, llamabench.

Step 2: Download the model (and the first gotcha: xet)

Files needed from unsloth/gemma426BA4BitGGUF:

The obvious command, hf download unsloth/gemma426BA4BitGGUF localdir …, crawled at ~0.8 MB/s and kept slowing down (Fetching 3 files: 0%), ETA in hours. That's the xet transfer protocol. A raw singleconnection HTTPS test against the same CDN measured 9.83 MB/s, so the connection was fine; xet was the bottleneck. HFXETHIGHPERFORMANCE=1 did not help.

Fix: disable xet, keep hf's hashverified downloader:

Immediately jumped to 820 MB/s. (Your mileage with xet may vary by region/peering, but if it's slow, HFHUBDISABLEXET=1 is the first thing to try.)

Step 2.5: The big one: exFAT + mmap = a 50minute hang

First benchmark attempt just… sat there. After 49 minutes the process was still at 97% on a single core with only 1.4 GB resident of the 16 GB model.

Cause: llama.cpp memorymaps the GGUF by default and pages it in lazily on access. Random page faults serviced by macOS's exFAT (fskit) driver are pathologically slow, so the model never finished loading.

Fix: nommap (one sequential read into RAM instead of random page faults):

Tradeoff: the whole model sits in RAM (16 GB of 36 GB here, fine, no swap). If you keep models on internal APFS, mmap is fine and you don't need this. The hang is specific to mmapoffexFAT. Every llamacli/llamaserver command below uses nommap.

Step 3: Benchmark methodology

Same tool for every number so deltas are fair: llamacli, greedy, fixedlength output.

ignoreeos temp 0 → exactly 128 greedy tokens every run → reproducible, lownoise. st (singleturn) → generate one response and exit. Plain nocnv dropped into an interactive loop and hung in this build; st is the reliable noninteractive switch. This build prints a compact [ Prompt: X t/s Generation: Y t/s ] summary line. llamabench is great but cannot drive MTP speculative decoding (no spec flags), so it's only a baseline crosscheck; the MTP comparison uses llamacli for both sides.

Prompt (128 tokens generated): "Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases."

Step 4: Baseline vs MTP, and tuning specdraftnmax

Generation throughput, median of 3 interleaved rounds (raw runs in brackets):

Step 4: Baseline vs MTP, and tuning specdraftnmax table

Setup	Generation tok/s	vs baseline
Baseline (main only)	75.1 [74.6, 75.1, 75.9]	1.00×
MTP specdraftnmax 1	103.0 [102.2, 104.3, 103.0]	1.37×
MTP specdraftnmax 2	105.8 [105.7, 106.7, 105.8]	1.41×
MTP specdraftnmax 3	104.4 [96.7, 105.4, 104.4]	1.39×
MTP specdraftnmax 4	98.5 [97.4, 98.5, 100.1]	1.31×
MTP specdraftnmax 5	94.4 [93.8, 94.4, 94.6]	1.26×
MTP specdraftnmax 6	88.3 [87.4, 88.3, 89.3]	1.18×

Step 4: Baseline vs MTP, and tuning specdraftnmax

Prompt processing stays ~283 tok/s across every setting, MTP only touches generation, as expected.

Both models measured here (Qwen's curve, Step 8, shown for contrast). The peak moves with the hardware/model, Gemma tops out at n=2, Qwen at n=3, so sweep yours; don't copy a number.

Takeaways: MTP is clearly worth it: +41% generation throughput on this machine, no quality change (speculative decoding is exact). The sweet spot here is specdraftnmax 2 (n=3 is statistically tied). The reference guide found n=3 best on an M1 Max, Unsloth explicitly say the optimum is hardwaredependent and to sweep 16. On the M4 Max it landed at 2. Sweep on yours. Above n=3, throughput falls off monotonically, drafting too far ahead wastes work when the target model rejects the speculation. Run benchmarks 3× and take the median: single runs swung ±15% here (e.g. one n=3 run read 96.7 vs a median of 104.4).

Step 5: Add the multimodal projector (screenshots)

Gemma 4 26BA4B isn't natively multimodal; you add vision by loading the projector with mmproj. The question is whether that taxes text generation. It doesn't:

Step 5: Add the multimodal projector (screenshots) table

	Generation tok/s (2 runs)
MTP n=2, no projector	106.9, 107.0
MTP n=2, + projector	106.7, 105.9

Step 5: Add the multimodal projector (screenshots)

Within noise, loading the projector costs ~0 on text throughput. So there's no reason to run without it; keep it loaded and you can paste screenshots to the agent for free.

Step 6: Serve it (OpenAIcompatible)

Notes for this machine: nommap again, same exFAT reason. Port 8088, not 8080: 8080 was already taken here (couldn't bind HTTP server socket). Pick a free one: lsof iTCP:8080 sTCP:LISTEN. c 32768, not 65536. With 36 GB RAM and the 16 GB model held resident (no mmap), free memory drops to ~16% with the server up; 32K context is the safe ceiling here. The 64 GB reference machine can afford 65536.

Ready in ~12 s (model warm in cache). Verify:

Headsup: this Gemma 4 build is a reasoning model. llama.cpp returns the chainofthought in reasoningcontent and the final answer in content. With a small maxtokens you get an empty content (finishreason: length) because it spent the budget thinking. Give it room (maxTokens ≥ a few thousand).

A startgemma.sh that wraps this in tmux is in the repo.

Step 7: Wire up Pi as the coding agent

Pi reads providers from ~/.pi/agent/models.json. Don't overwrite it if you already use Pi, merge a new provider in (back it up first). The gemma4local provider:

Key fields: baseUrl → the llama.cpp server; authHeader: false (local, no key); input: ["text","image"] (else Pi treats it textonly and won't send screenshots); reasoning: true (this build emits reasoningcontent). Confirm:

Real task through Pi against the local model:

That's the full loop: Pi → llama.cpp server → Gemma 4 + MTP → correct code, all local.

Does it actually code? 16 challenges

tok/s is meaningless if the output is wrong. So I built a verifiable gauntlet: each task is sent to the server, the code block extracted and executed against hidden assertions in a sandboxed subprocess (codeeval.py). A pass means it genuinely runs correctly. Greedy (temperature 0), maxtokens 8192. Four flavors:

1. Greenfield: write a function from scratch. 2. Bugfixgivenfailingtest: read broken code + the failing test, fix it. 3. Gotchas: tasks LLMs are documented to flub (Easy Problems That LLMs Get Wrong): overlapping counts, a "looks like balancedbrackets but isn't" trap, the 0.1 + 0.2 float classic. Anticheat: eval()/re are banned where they'd trivialize the task. 4. The viral one: Simon Willison's "Generate an SVG of a pelican riding a bicycle", rendered to PNG with headless Chrome.

Gemma 4 26BA4B Q4 + MTP: 14/15 verifiable

Does it actually code? 16 challenges table

	Task	Flavor	Result	Time	tok/s
1	mergeintervals	greenfield	✅	5.7 s	104.6
2	isbalanced	greenfield	✅	6.4 s	107.3
3	changedfiles (parse a diff)	greenfield	✅	27.5 s	102.8
4	LRUCache (O(1))	greenfield	✅	26.9 s	98.4
5	romantoint	greenfield	✅	10.1 s	108.3
6	bugfixmutabledefault	bugfix	✅	3.9 s	107.6
7	bugfixbinarysearch (offbyone)	bugfix	✅	13.3 s	108.8
8	bugfixtobase (skips 0)	bugfix	✅	6.5 s	107.7
9	quine (prints own source)	hard	❌ spiral	169 s†	96.7
10	calcprecedence (no eval)	hard	✅	42.4 s	101.6
11	regexismatch (./, no re)	hard	✅	26.7 s	102.8
12	editdistance (Levenshtein)	hard	✅	14.9 s	106.1
13	countoverlapping	gotcha	✅	22.5 s	103.7
14	deepnesting (≥3 deep)	gotcha	✅	7.7 s	108.2
15	bugfixfloateq (0.1+0.2)	gotcha	✅	34.4 s	102.6

Does it actually code? 16 challenges

It nailed every gotcha, overlapping count, the depth≥3 trap (didn't fall for "balanced brackets"), the floatingpoint comparison, plus a fromscratch regex engine and an expression evaluator with eval() banned.

The one failure is the interesting one. †The quine broke it. Gemma is a reasoning model, and a selfreproducing program is exactly the kind of selfreferential puzzle that sends it into an infinite thinkspiral: it generated 16,384 tokens of reasoning, ~170 seconds straight, and never emitted a finished program (finishreason: length). A 26B model that writes a correct Levenshtein DP and a regex matcher first try, but talks itself to death on a quine. (Hold that thought. At the temperature Gemma's makers actually recommend, this spiral vanishes, and it turns out to be the most useful finding in the whole piece. See the parameters section near the end.)

Prompt: "Generate an SVG of a pelican riding a bicycle." One shot, 39 s, 2.8 KB of SVG:

White body, an actual orange pouch (the pelican tell), an eye, a red bicycle with spoked wheels and pedals, sky/clouds/grass. For a 4bit local model running on a laptop, that's a shockingly coherent pelican.

Step 8: Qwen3.6 35BA3B: the challenger

Same recipe, one difference: Qwen3.6's MTP head is baked into the main GGUF, so you enable speculative decoding with spectype draftmtp and no modeldraft. The download (unsloth/Qwen3.635BA3BMTPGGUF) is 21 GB for the main model. On 36 GB you cannot run Gemma and Qwen at once, stop one first.

Step 8: Qwen3.6 35BA3B: the challenger table

nmax	gen t/s (median of 3)	speedup
baseline	67.2	1.00×
1	89.3	1.33×
2	92.4	1.38×
3	93.5	1.39×
4	83.6	1.24×
5	80.2	1.19×
6	75.3	1.12×

Step 8: Qwen3.6 35BA3B: the challenger

Same MTP win (~1.4×), but the optimum is n=3 vs Gemma's n=2, and Qwen tops out at 93.5 t/s vs Gemma's 105.8, slower, because it's a bigger model (35B vs 26B total).

Gauntlet: 14/15, and the quine plot twist

Step 8: Qwen3.6 35BA3B: the challenger table

Result	Tasks
✅ 14 PASS	all greenfield, all bugfixes, all gotchas, calc (no eval), regex (no re), Levenshtein
❌ 1 FAIL	quine

Step 8: Qwen3.6 35BA3B: the challenger

Both models score 14/15. Both fail only the quine, but in opposite ways: Gemma never finishes: it thinkspirals past 16,384 tokens and emits no complete program. Qwen finishes a complete, running program in 57 s, but it's a wrong quine (stdout != source). It commits to an answer; the answer just isn't selfreproducing.

The other tax is speed. Qwen is slower per token (~85 vs ~105 t/s) and more verbose, so wallclock blew out on the hard tasks: LRUCache 80.8 s (Gemma 26.9 s), regexismatch 99.6 s (Gemma 26.7 s), bugfixfloateq 75.3 s (Gemma 34.4 s).

Where Qwen earns its "better coder" reputation is the openended task. Same prompt, 61 s, 9.7 KB of SVG (vs Gemma's 2.8 KB):

Eyelashes and a blushing cheek, a wing on the handlebar, a foot on the pedal, motion lines, a road with reflectors, a rayed sun. It's not just bigger, it reads as a pelican in motion. Gemma's was clean and correct; Qwen's has intent. Better, but it took 1.6× longer.

Step 8b: The specialist: NorthMiniCode1.0 (Cohere)

A wildcard third contender: unsloth/NorthMiniCode1.0GGUF, Cohere Labs' 30BA3B MoE built specifically for agentic coding (3B active params). Same quant tier (UDQ4KXL, 19 GB).

Gotcha: it won't load on stock llama.cpp. North uses the cohere2moe architecture, which isn't in mainline (you'll get unknown model architecture: 'cohere2moe'). You need the unmerged PR 24260:

(This switches your llama.cpp to the PR branch, additive, so it still runs Gemma and Qwen fine; git checkout master to go back. North's numbers below are from this build, not 57fe1f0.)

No MTP draft, so no speculative speedup, but it doesn't need one. Baseline 91.5 t/s generation and ~577 t/s prompt processing, its prompt speed is 2× Gemma and Qwen, which matters a lot for an agent chewing through tool output and long files.

North is the only one of the three to ace the gauntlet, and it's the only one that wrote a quine. The exact twoliner, verified stdout == source:

Where Gemma spiraled and Qwen guessed wrong, the code specialist produced the canonical Python quine in 6 seconds. And it's the fastest endtoend of all three, because North isn't a chatty reasoning model, it just emits code: most tasks finished in 1.66 s (vs Gemma's 427 s and Qwen's 5100 s) at the same ~90 t/s.

The same prompt, "a pelican riding a bicycle":

The bird isn't on the bike. North drew a (decent) bicycle and a (passable) bird and set them side by side, no spatial reasoning about "riding." The model that crushed every rigorous coding task has the weakest grasp of the openended, creative one. Specialization, made visible.

Agentic tests: tool use & code execution

Solving a function in isolation isn't the job. A coding agent calls tools and reasons about code it didn't write. So I added two suites grounded in the current (2026) benchmarks, graded the way those benchmarks grade:

Tool use → BFCL v4 (Berkeley FunctionCalling Leaderboard, updated Apr 2026). Six categories: a simple call, picking the right tool of several (multiple), two calls at once (parallel / parallelmultiple), irrelevance (correctly not calling a tool), and a stateful multiturn flow (search flights → read the tool result → book the cheaper one). Graded BFCLstyle by matching the function name + argument values. (tooleval.py) Code execution → LiveCodeBench v6 (contaminationfree, continuously refreshed). Its "code execution" task: predict the exact output of a snippet, pure code reasoning, no generation. Seeded with semantic gotchas (mutable default args, latebinding closures, negative modulo), checked against the real output. (execeval.py)

(SWEbench Verified and τ²bench are the other two I'd want, but they need Docker + repos and a simulated user/DB, not a laptopinanafternoon job. Cited, not run. All three models served with jinja for tool calling.)

North runs the table: 6/6 tools, 7/7 execution, on top of 15/15 code. The agentic coding model is, unsurprisingly, the best agent, nailing the stateful multiturn flight booking and every Pythonsemantics gotcha. Two findings make the others human:

Gemma's reasoningspiral is real and repeatable. 6/6 on tools but 5/7 on execution, and the two misses were the gotchas, where it derived the correct answer then repeated it ~40 times until it ran out of tokens. The exact failure mode as the quine: when Gemma overthinks, it can't stop. Qwen reasons but lands the plane (7/7 execution, gotchas included), yet dropped one tool test, calling getweather with empty arguments on the simplest case. Solid reasoning, occasionally sloppy tool args.

The threeway showdown

Three 4bit models, one 36 GB laptop. Coding gauntlet + agentic suites. Here's the whole thing:

The threeway showdown table

	🟢 Gemma 4 26BA4B	🔵 Qwen3.6 35BA3B	🟠 NorthMiniCode 30BA3B
Size (Q4KXL)	16 GB	21 GB	19 GB
MTP speedup	1.41× (n=2)	1.39× (n=3)	n/a
Gen speed (best)	105.8 t/s	93.5 t/s	91.5 t/s
Prompt speed	283 t/s	250 t/s	577 t/s
Coding gauntlet	14 / 15	14 / 15	15 / 15
The quine	❌ spiraled	❌ wrong	✅ canonical
Tool use (BFCL v4)	6 / 6	5 / 6	6 / 6
Codeexecution (LCB v6)	5 / 7	7 / 7	7 / 7
Wallclock per task	fast	slowest	fastest
Pelican 🐦🚲	good (on bike)	best (in motion)	worst (off bike)
Multimodal (sees images)	✅	✅	❌
Reasoning model	yes	yes	no (just codes)

The threeway showdown

One caveat that grows into a whole section below: the quine row and the pertask times are at greedy (temp 0), the setting I used to keep the benchmark reproducible. At each model's recommended temperature the reasoning models largely solve the quine too (Gemma goes 15/15). See "Best parameters for daily coding on a 36 GB Mac" near the end.

Every cell is code that was extracted and executed. The entire field is green except one row, the quine, and North is the only column that clears it.

Gemma wins raw generation; North wins prompt processing (~2×) and finishes the whole gauntlet in a third of Qwen's time, because it writes code instead of paragraphs about code.

All three are genuinely usable. 1415 of 15 verifiable tasks, on a laptop, offline, at 4bit. Local coding agents crossed the "actually good enough" line. Pick by job, not by leaderboard. North is the coding specialist, fastest, aced the gauntlet, wrote a quine, and swept the agentic suites (6/6 tools, 7/7 execution), but it's textonly and has no creative/spatial sense. Qwen is the quality generalist, best on the openended task, but the slowest, by a lot. Gemma is the balance: fastest generation (with MTP), multimodal, solid everywhere, it just can't quine. Reasoning isn't free. Gemma and Qwen think, which helps nuance and burned them on the quine (one spiraled forever, one overthought into a wrong answer). North just writes code, and on rigorous tasks it was both more correct and several times faster endtoend. MTP is worth it where you can get it (~1.4× generation, identical output). North shows the other lever: a small activeparam MoE is fast without a draft model, and its ~577 t/s prompt throughput is the unsung hero for agent workloads.

If I had to keep one on this laptop for daytoday coding: North for pure code, Gemma when I want speed + screenshots, Qwen when I want the nicest output and don't mind waiting.

Best parameters for daily coding on a 36 GB Mac

Everything above used greedy decoding (temperature 0), on purpose: it makes the benchmark reproducible. But greedy is not how you should actually run these models for real work, and it hid a twist.

That quine failure was mostly my fault, not the models'. Greedy always picks the single most likely next token, which is exactly how a model paints itself into a repetition corner. Switch to the temperature each model maker recommends for coding and the spiral disappears:

Best parameters for daily coding on a 36 GB Mac table

Model	Greedy (temp 0)	At the recommended temperature
Gemma 4	quine spirals (0/3), gauntlet 14/15	temp 1.0: quine 3/3, gauntlet 15/15
Qwen3.6	quine wrong	temp 0.7: quine 2/3 correct

Best parameters for daily coding on a 36 GB Mac

Gemma at its recommended temperature 1.0 (yes, high, the Gemma team genuinely recommends high temperature for coding) wrote a correct quine on all three tries and went 15/15 on the full gauntlet, at the same speed. The model could always do it; greedy was the problem. (If you must run neargreedy, a repeatpenalty of 1.3 also breaks the loop. The DRY sampler at 0.8 did not.)

So for daily use, do not run greedy. Use the maker's coding settings:

Best parameters for daily coding on a 36 GB Mac table

Model	temp	topp	topk	other
Gemma 4 26BA4B	1.0	0.95	64	minp 0
Qwen3.6 35BA3B	0.7	0.8	20	repeatpenalty 1.05
NorthMiniCode	0.3 to 1.0	0.95	40	robust; it never spiraled, even at greedy

Best parameters for daily coding on a 36 GB Mac

The other half of "daily" is memory. You are not running a dedicated server box, you are coding while the model runs, so it has to share 36 GB with macOS, your editor, and a browser. With the full setup (MTP draft + projector + 32K context + a q8 KV cache), the picture is tight:

Best parameters for daily coding on a 36 GB Mac table

Model (Q4KXL)	Free RAM with the server up, c 32768
Gemma 4 (16 GB)	~17% (~6 GB)
North (19 GB)	in between
Qwen3.6 (21 GB)	~10% (~3.6 GB), the tightest

Best parameters for daily coding on a 36 GB Mac

Two levers buy headroom, both nearly free for coding:

Quantize the KV cache: cachetypek q80 cachetypev q80 (needs fa on). Roughly halves KV memory with negligible quality cost on code, and the saving grows with context length. Rightsize the context: you rarely need 32K for a single coding turn. In this local setup, dropping to 16K to 24K frees several GB and is often the largest headroom lever. (llama.cpp also reserves an 8 GB prompt cache by default, which you can trim.)

Putting it together, the server I would actually keep running (Gemma, the best allround daily pick: smallest, multimodal, fastest):

Swap in Qwen (specdraftnmax 3, no modeldraft, c 16384, temp 0.7) when you want the nicest output, or North (the PR build, no MTP, temp 0.3) for purecode speed. The headline still holds, North is the specialist, but for a generalist daily driver on 36 GB, Gemma at temperature 1.0 with a q8 KV cache and a rightsized context is the sweet spot.

The real takeaway

A year ago, "local coding model" meant a toy you tolerated. Here are three different 4bit models, none over 21 GB, each solving ~15 of 15 real, executed coding tasks on a laptop, offline, fast enough to drive an agent. The question stopped being "is local good enough?" It is. The new question is "which specialist do I load today?", and you can hold the answer to that on a USB SSD.

The only model that wrote a quine out of the box was the one built to code. The reasoning models got there too, but only at the temperature their makers recommend, not the greedy default I benchmarked on. The only ones that drew a believable pelican were the ones built to think. Nobody got everything for free. That's the whole map of where openweight coding agents are in mid2026, drawn in 16 challenges and one parameter sweep.

Everything here reproduces: the build, the exact GGUFs, the sweep scripts, the gauntlet, and the three pelicans. Clone it, point it at your machine, and sweep specdraftnmax yourself, your optimum won't be mine. Then go argue about the pelicans.

The cost angle, and where this lives

Every number in this piece came from models running on one machine, offline, for $0 in API spend. That is the whole point: local inference is the cheapest token you will ever run. But "local or a cloud API" is a real decision with real math, and the cloud side is easy to underestimate until the invoice lands.

That math is what I build at ByteCosts: sourcebacked pertoken and percontext pricing, modelversusmodel comparisons on standardized assumptions, and calculators that turn a workload into a monthly number before you ship. Know the real cost before the invoice. If this kind of measured, nohandwaving breakdown is your thing, the whole site is more of it.

Hardware: Apple M4 Max, 36 GB, macOS 26.5. llama.cpp 57fe1f0 (Gemma/Qwen) + PR 24260 (North). Models: Unsloth GGUFs at UDQ4KXL. All numbers measured June 2026.

What this article covers

Key concepts, in plain terms
The machine
What we're building
Step 1: Build llama.cpp (Metal)
Step 2: Download the model (and the first gotcha: xet)

Use it with ByteCosts calculators

After reading the research note, open the related calculator and replace the example assumptions with your own users, requests, tokens, seats, or platform usage.

The goal is to convert the article's cost pattern into a concrete monthly run-rate, per-user margin, or break-even point your team can discuss.

Frequently asked questions

Is this article available before JavaScript runs?

Yes. The prerendered HTML includes the article summary, direct answer, key sections, related tools, and citation block for crawlers and readers without JavaScript.

Can I model the article's scenario with my own assumptions?

Yes. Use the related ByteCosts calculators to replace the article's example numbers with your own workload, usage, and pricing assumptions.

Local AI coding showdown on a 36 GB Mac: Gemma vs Qwen vs North. ByteCosts. Updated 2026-06-13. https://bytecosts.com/blog/local-ai-coding-showdown-gemma-qwen-north/

Sources

Machine-readable

Markdown mirror