Tools & Tutorials

The quiet pleasure of running LLMs on your own laptop in 2026.

On a plane to Berlin with no Wi-Fi, I asked a 12-billion-parameter model running on my MacBook to help me debug a regex. It worked. Here's why local AI feels different now.

Tools & TutorialsProductivityAI Literacy

Published April 27, 2026

7 min read

I was on a flight to Berlin three weeks ago, somewhere over Greenland with no Wi-Fi, when I got stuck on a regex that wouldn't behave the way I expected. The kind of bug where you read it for the tenth time and start questioning whether \s actually means what you think it means.

I asked the local Llama running on my MacBook. It read the regex, told me where my non-greedy quantifier was getting eaten by an unanchored start, and gave me a fix that worked the first time.

I closed the laptop and watched the clouds for a minute and thought: this is fine, actually. This is great.

I had no internet. I had no API key burning down a quota. I had no usage limit reminding me that this was costing somebody compute. I had no privacy panic about pasting our internal code into a third-party. I had a model on my laptop that was good enough at this kind of task, and it just answered.

Model	Quantization	RAM needed	Tokens/sec (M-series Mac)	Good for
Llama 3 8B	Q4_K_M	6 GB	~45	Daily chat, drafting, code review
Mistral 7B	Q5	7 GB	~35	Summarization, short-form writing
Phi-3 Mini	Q4	3 GB	~70	Tab-completion, fast lookups
Qwen 2.5 14B	Q4_K_M	10 GB	~22	Best general quality at laptop size
Llama 3 70B	Q4	42 GB	~6	If you have a 64GB+ box and patience

What actually runs on consumer hardware in 2026. Numbers from my own M3 Max, your mileage will vary by ±25%.

That feeling, the just answered feeling, is the thing nobody mentions when they argue about whether local AI is "as good" as frontier API models.

It isn't as good. That's not the point.

Why local feels different

Three things, when you actually live with it for a few weeks.

The latency is local. No network round trip. The first token shows up so fast it changes how you interact with the tool. You start asking it things that aren't worth asking when there's a one-second pause. You ask it to expand a one-line comment. You ask it to explain a snippet of weird logging output. You ask it the kinds of small questions you wouldn't bother sending to a remote model because the friction was too high.

The privacy is real. You're going to underestimate how much this matters until you start using it. The amount of mildly-confidential code, half-finished sentences, customer support drafts, and personal-life context I now route through the local model is genuinely large. None of it is leaving my laptop. It's the same difference as the difference between writing in a paper journal and writing in a shared Google Doc you forgot was shared. Subtle, but constant.

The model can't update from underneath you. The API you used last Tuesday isn't the API you're using on Friday. Local models change when you decide to change them. If you find a checkpoint that vibes with you, you can keep it for two years. There's no version drift in the middle of your project.

What's actually worth running today

I'm not going to recommend specific models for very long because this list will be stale in three months. As of this week, in spring 2026:

For coding tasks, the Llama 4 8B instruct variants run beautifully on any recent Apple Silicon machine and are surprisingly close to mid-tier API models on focused dev work. They are not Claude. They don't need to be. They are good at single-file refactors, regex debugging, and turning sketchy bash into clean bash.

For general writing and Q&A, Mistral Small 4 is the one I keep coming back to. It writes with an actual voice. It doesn't end every paragraph with a feel-good sentence the way the bigger models do.

For longer-context tasks (summarising a 30k-token PDF, for instance), the Gemma 3 12B instruct model is the surprise of the year. The context handling is real. The summaries are accurate. It runs on a 32 GB Mac and doesn't melt anything.

If you're on a Linux box with a 24 GB GPU, you have basically every option open and I won't pretend my opinion matters more than the benchmarks.

<line x1="142" y1="105" x2="218" y2="105" stroke="currentColor" stroke-width="1.2" opacity="0.6" marker-end="url(#arr1)"/>
<text x="180" y="98" text-anchor="middle" font-size="10" opacity="0.55">HTTP / stdio</text>

<rect x="220" y="60" width="160" height="90" rx="6" fill="none" stroke="currentColor" stroke-width="1.2" opacity="0.6"/>
<text x="300" y="92" text-anchor="middle" font-weight="600">Runtime</text>
<text x="300" y="112" text-anchor="middle" font-size="11" opacity="0.7">Ollama · llama.cpp</text>
<text x="300" y="128" text-anchor="middle" font-size="11" opacity="0.7">LM Studio · mlx</text>

<line x1="382" y1="105" x2="458" y2="105" stroke="currentColor" stroke-width="1.2" opacity="0.6" marker-end="url(#arr1)"/>
<text x="420" y="98" text-anchor="middle" font-size="10" opacity="0.55">prompt</text>

<rect x="460" y="60" width="120" height="90" rx="6" fill="#34d399" fill-opacity="0.12" stroke="#34d399" stroke-width="1.2"/>
<text x="520" y="92" text-anchor="middle" font-weight="600" fill="#0b0b0a">Quantized model</text>
<text x="520" y="112" text-anchor="middle" font-size="11" opacity="0.75">7B · 13B · 70B</text>
<text x="520" y="128" text-anchor="middle" font-size="11" opacity="0.75">RAM or VRAM</text>

<text x="300" y="40" text-anchor="middle" font-size="11" opacity="0.55">No network. No subscription. No telemetry.</text>
<text x="300" y="190" text-anchor="middle" font-size="11" opacity="0.55">Trade-offs: smaller context, slower TTFT, more setup.</text>

The whole stack lives on one machine. No subscription, no telemetry, no rate limit — just your fan, occasionally.

The tools that disappeared into the background

Two are doing the heavy lifting for me right now.

Ollama. It just runs. I forget it's there. I install a model, I forget about it, I run a command, it answers. It's the closest thing to "this is a Unix utility" that we've gotten with AI tooling.

LM Studio. Better for when I want to swap models and compare. The UI is fine. I use it when I'm in exploration mode.

I'm sure both of these will be replaced by something better next year. That's fine.

What it's actually good for

If you're trying to figure out whether local AI is worth your weekend, the honest answer is: it depends on what you're doing.

It's great for: regex, small refactors, bash, summarising your own writing, exploring an unfamiliar API, getting unstuck on a single-file problem, anything you don't want leaving your machine.

It's not great for: tasks that require deep reasoning across many files, things where the frontier model's larger world-knowledge actually matters, anything that benefits from the most recent training cutoff.

So mostly, I run a local model for the day-to-day, and reach for the API model for the harder thing. The split has felt right for about six months now.

Closing

The frontier models are still better at the hardest tasks. They probably will be for a while. But the gap between "best available API model" and "best model I can run on a $2000 laptop" is, today, narrower than it has any right to be. And the experience of working with a model that's yours, on your machine, without anybody else's quota or terms-of-service watching, has a kind of dignity to it that I didn't expect to start caring about.

If you've been meaning to try it, this is the weekend.