Small Models, Real Work
Three weeks in, what I’ve learned has less to do with the models and more to do with how I think about tools. Each model has a personality. Quirks. A failure mode you won’t find in any benchmark. And a specific job it does surprisingly well.
The context: I started a weekly newsletter on Substack - five article picks, my take on each, out every Sunday. A curated digest with a hard deadline. From the start I wanted to know if small local models could do real work inside that pipeline. Not experiments. Not demos. Actual drafts, on a schedule, every week.
The setup is deliberately unglamorous. An Intel NUC on my desk. 64GB of RAM. No GPU. CPU-only inference through Ollama. Small models - 2 to 14 billion parameters - doing the heavy lifting: ranking candidates, picking the best ones, writing first-draft takes. None of it touches a cloud API. The hardware is already on the network. Every run costs zero.
What I found surprised me. Not because the models were good or bad - but because the question of “good enough” turned out to be the wrong question entirely.
I didn’t do this to prove a point about running AI locally. I did it because I was curious whether models this small could do useful work in a pipeline I already ran. Turns out the answer is complicated.
phi4-mini: The Sprinter
phi4-mini is 2.3 gigabytes. It runs at 20-30 tokens per second on CPU. That’s fast enough for quick iterations, good enough for first passes, and exactly right for the moment when you need something on screen to react to.
I used it to draft newsletter takes. It was fine at getting words down. The problem was which words. About 30-40% of what it produced was publishable. The rest was the kind of writing you’ve seen a hundred times. “Really interesting.” “Pretty wild.” Corporate-adjacent filler that technically says something but lands nowhere. I’d read the take, then read the original article, then write my own take anyway.
Fast and wrong is still wrong. But fast and 30% right? That’s a starting point - if you know going in that the edit pass is where the real work happens.
phi4-mini-reasoning: Handle With Care
phi4-mini-reasoning is excellent at logic, structured analysis, and anything that has a right answer. I reached for it when I needed to work through a problem step by step.
Then I made the mistake of asking it to write something creative.
It didn’t write. It reasoned about writing. Internal think blocks appeared in the output. It hallucinated personas - invented characters to attribute the writing to, as if it needed permission to just produce a paragraph. The output was fascinating in the way that a car crash is fascinating. Completely unusable.
A model built for reasoning will try to reason about everything. Including things that don’t want to be reasoned about.
qwen3.5:9b: The Night Shift
qwen3.5:9b runs at 4.3 tokens per second. Too slow for anything interactive. But it has the best output quality of the smaller models, and that’s what matters when you’re running batch jobs overnight.
I pointed it at 16 coaching PDFs - periodisation frameworks, training methodology documents, athlete case studies. It processed them into structured, indexed knowledge while I slept. When I checked in the morning, the output was solid. Usable. No hallucinations, no invented citations, no creative liberties with the source material.
But I almost never got to that point, because the first overnight run produced nothing.
The Ollama /api/generate endpoint returned empty responses. No error. No timeout. Just blank. I spent an hour debugging before discovering that qwen3.5 runs with thinking mode enabled by default. You have to use /api/chat and explicitly set think to false. Otherwise the model thinks but never speaks.
One wasted overnight run. One hour of debugging. One config flag. That’s the kind of lesson that doesn’t show up in model cards.
qwen3:14b: The One That Changed the Pipeline
I pulled qwen3:14b on a Friday night and ran it side by side with phi4-mini on the same five articles. The difference was immediate.
phi4-mini: 30-40% publishable. qwen3:14b: 70-80%.
It wasn’t just better at writing. It was better at listening. The takes were more specific. They pulled real details from the articles instead of summarising around them. When I asked it to write in my voice - direct, short sentences, no filler - it actually did. phi4-mini would nod along and then write “This is a really fascinating development.”
qwen3:14b runs at 4-5 tokens per second. A full newsletter draft takes about 16 minutes. That’s slow by any interactive standard. But for a pipeline that runs overnight and delivers a draft to my inbox by morning, 16 minutes is nothing.
The right model at the right speed beats the fastest model every time.
The scorecard
After three weeks I sat down and mapped what actually worked where. Not benchmarks - my benchmarks. Real tasks, real results.
| phi4-mini | phi4-mini-reasoning | qwen3.5:9b | qwen3:14b | qwen2.5-coder:3b | |
|---|---|---|---|---|---|
| Size | 2.3GB | 2.9GB | 6.1GB | ~9.3GB | 1.8GB |
| Speed (CPU) | 20-30 tok/s | ~20 tok/s | 4.3 tok/s | 4-5 tok/s | ~20 tok/s |
| Newsletter takes | 30-40% publishable | Wrong tool | Too slow | 70-80% publishable | N/A |
| Coaching PDFs | Not tested | Not tested | 16 PDFs overnight, clean | Not tested | N/A |
| Creative writing | Usable but bland | Think blocks, hallucinated personas | Not tested | Not tested | N/A |
| Code tasks | Not tested | Not tested | Not tested | Not tested | Clean, no hallucinations |
| Voice matching | Poor - corporate speak | N/A | Good on structured tasks | Excellent | N/A |
| Quote extraction | Can’t reliably pull quotes | N/A | N/A | Pulls and uses quotes naturally | N/A |
| Draft time (5 articles) | ~2 min | N/A | N/A | ~16 min | N/A |
| API gotcha | None | Thinking blocks leak into output | /api/generate returns empty - must use /api/chat + think: false | Smart curly quotes - also needs think: false | None |
| Best for | Quick iterations, throwaway drafts | Logic, maths, structured analysis | Overnight batch processing | Production drafting, voice-matched writing | Code generation |
The number that tells the whole story: qwen3:14b doubled the publishable rate at the cost of 8x the time.
For anything interactive - quick iterations, brainstorming, getting something on screen to react to - phi4-mini is still the right reach. It’s fast, it’s good enough to start with, and the edit pass is where the real work happens anyway.
For overnight pipelines where nobody’s waiting? The trade-off is obvious. 16 minutes for a draft I can actually use versus 2 minutes for a draft I’ll rewrite from scratch. I’ll take the 16 minutes every time.
The reasoning model sits in its own lane entirely. It’s the best tool I have for working through a structured problem. But ask it to do anything generative and it will reason about generating instead of just generating. Know which lane you’re in before you reach for it.
And the coder stays in its lane too. Small, fast, does code. I don’t ask it to write prose and it doesn’t disappoint me.
Five models. Five jobs. No single model does everything well, and the moment you accept that, the whole setup starts making sense.
The things I learned that have nothing to do with models
Three weeks of this taught me more about working with small models than any benchmark comparison could.
Tell models what NOT to do. “Avoid clichés” doesn’t work. An explicit list of banned phrases does. I maintain a list - “game-changer”, “deep dive”, “paradigm shift”, a dozen others. The quality check function runs against every draft. If it finds a banned phrase, it re-drafts automatically. This catches more bad output than any prompt engineering trick I’ve tried.
Estimate before you run. qwen3:14b at 4-5 tokens per second, 3000 tokens output, five articles plus intro. That’s about 16 minutes. Know that number before you kick off the job, because “I’ll just wait and see” at midnight is how you end up staring at a terminal until 1am.
Quality gates beat model upgrades. A banned word list plus a validation function that checks length, tone, and structure catches what the model misses. I added this after week one and it made more difference than switching from phi4-mini to qwen3:14b.
The 70% rule. If a model gets you 70% of the way there, you’ve won. The last 30% is your job - and it should be. That’s where your voice lives. That’s the part that makes the newsletter yours and not just “AI-generated content about this week’s articles.”
What’s coming next
This is the first in a series. I want to dig into the key elements that actually work with small models (spoiler: it’s about constraints, not creativity). The quality pipeline. How to run multiple models on the same box without them fighting over memory.
But the real lesson from these three weeks is simpler than any of that. The models are tools. Good tools. Each one suited to a specific job. The craft isn’t in picking the most powerful model. It’s in knowing which one to reach for, and when to put it down and write the thing yourself.
For reference: my workflow
RSS Feeds + newsletters (100+ items)
│
▼
MORPHEUS — Saturday 8pm cron ──► Fetch, dedupe, score candidates ──► bulletin-input.json (30 items)
│
▼
HYPNOS — Ollama (qwen3:14b) ──► Select 5 picks, write first-draft takes ──► bulletin-draft.json
│
▼
Percy (Sunday morning) ──► Review picks, rewrite takes (quote + insight), write intro + thinking
│
▼
Microsoft Graph API ──► Sends email (percy@raposo.ai)
│
▼
My review + Final publish ──► Substack (What I Read This Week)
The heavy lifting happens on a NUC I already own. qwen3:14b on HYPNOS scores the candidates, picks the best ones, and writes first-draft takes entirely on-device - work that never touches a cloud API. By the time Percy does the review pass, only the shortlist remains. The hardware is sunk, so every draft run costs exactly $0. I can iterate the prompts, rerun the pipeline, experiment freely - no token meter watching. That changes how you build. It’s not about saving money. It’s about using the right tool for the job - and sometimes learning how to use the tools you have access to matters more than asking for a better one.