Not Every Question Deserves Your Best Model

April 18, 2026 · 7 min read

Not Every Question Deserves Your Best Model

Last month I burned through my Claude Pro usage in 12 days. I had three personal projects going — this blog, an automation experiment, and a side app I was using to explore AI-assisted workflows. I wanted to see how far I could push a single subscription across all of them. The answer: not far. I was sending everything — commit messages, quick questions, complex architecture decisions — to the same model, at the same priority. Every prompt got the full treatment. It was like hiring a senior consultant to sort your mail.

That experience forced me to think about something most developers skip: the cost of using AI carelessly. Tokens aren't free, context windows aren't infinite, and the difference between a smart setup and a lazy one can be the difference between a tool that pays for itself and one that drains your budget.

Tokens Are the Currency

Every time you send a prompt, you're spending tokens — on the input (what you send) and the output (what the model generates). A rough mental model: 1 token is about 4 characters in English. A 500-line Ruby file is roughly 4,000-5,000 tokens. Send that file as context with every prompt, and you're burning thousands of tokens before the model even starts thinking about your question.

On a subscription plan, you don't usually think about this. You pay a flat monthly fee and use it until you can't. It feels unlimited until it isn't. But the moment you start working with the API — where you pay per token — the perspective shifts completely. You start seeing every prompt as a transaction, and suddenly you care a lot about what you're sending and how much of it.

Cloud models charge per token. The price difference between tiers is steep. As of early 2026, Claude Opus costs roughly 15x more per token than Haiku. GPT-4o is roughly 17x more than GPT-4o-mini. Those ratios matter when you're making dozens of requests a day.

The question isn't "which model is best." It's "which model is good enough for this task."

Match the Model to the Task

A note before I get into this: most of my experience is with Anthropic's models. I've tried others, but Claude is where I've spent the most time, and it's where I've built the strongest intuition for what works. I've been digging into AI harnesses — systems that route different tasks to different models or agents. I haven't gone deep enough to build a full harness yet, but between the talks I've watched and my own experimenting, I started forming a clear picture of which model fits which job.

Here's how I think about it:

Haiku / local 7B-13B models — Generating boilerplate, writing commit messages, formatting text, explaining simple errors, drafting comments. Tasks where speed matters more than depth. If the model gets it 80% right and you fix the rest in 10 seconds, that's a win.

Sonnet / mid-tier models — Most day-to-day coding. Writing a new controller action, debugging a failing spec, reviewing a pull request, refactoring a method. These models handle context well enough and cost a fraction of the top tier.

Opus / top-tier models — Architecture decisions, complex multi-file refactors, debugging something that spans your entire request cycle, or writing something that requires real reasoning across a large context window. Save these for the problems that actually need the horsepower.

The gotcha: it's tempting to default to the best model for everything because the quality difference is real. But if 70% of your prompts are Haiku-level tasks, you're overspending on 70% of your work.

The reverse gotcha is real too. I've had to move a handful of my own agents from Haiku up to Sonnet after watching them fail on tasks that looked mechanical — refactoring a 400-line file, renaming a symbol across a codebase — but hit a subtle edge case mid-run. Haiku would produce a half-broken output, I'd retry, it would fail differently, and after three attempts I'd have burned more tokens than a single Sonnet run would have cost. Cheap model plus retries is not cheaper than the right model the first time. The rule isn't "always pick the cheapest that could work" — it's "pick the cheapest that will reliably work", which is a different question.

What This Looks Like in Practice: Reviewing a PR

Theory is fine, but this clicked for me when I built a skill that actually uses multiple agents for a single task: reviewing pull requests.

The idea is simple. Instead of one model reading a whole PR and giving generic feedback, the skill detects what changed and dispatches specialized agents based on the diff. It works like a slot system:

  • A base code reviewer runs on every PR — it looks at structure, readability, naming, and obvious issues.
  • An architecture reviewer activates when the diff touches .rb, .tsx, or .js files — it focuses on design patterns and how the change fits the larger system.
  • A database reviewer activates when the diff includes migrations, schema.rb, or query-heavy code — it checks for N+1 risks, missing indexes, and migration safety.
  • A security reviewer activates when auth, permissions, or input handling code is in the diff.
  • A test reviewer looks at whether the change has adequate test coverage.
  • A PR structure reviewer activates when the diff is over 200 lines — it checks whether the PR should have been split.

Each agent gets the same shared context — the diff, the changed files, the project conventions from CLAUDE.md — but reads it through a different lens. The context is fetched once into a temp directory, so you're not burning tokens on five redundant GitHub API calls.

Here's the part that ties back to model selection: not every one of these agents needs Opus. Each reviewer runs in two phases. First, a Haiku checklist scan reads the diff against a static list of things to look for — did the author add a migration without an index, is there a secret in the diff, is a test file missing for new code. Haiku is fast and cheap and perfect for yes/no questions. If it finds nothing, the review ends there. If it flags something, the skill escalates to Sonnet to take a closer look and write the actual finding.

Clean PRs never hit the expensive model at all. Noisy PRs only spend Sonnet tokens on the specific issues Haiku surfaced. The PR structure check stays on Haiku start to finish — counting lines and checking commit boundaries, not doing deep analysis.

One skill, multiple agents, multiple model tiers. Each one matched to the complexity of what it's actually doing. That's the principle in action.

What's Next

Choosing the right model is only half of the equation. The other half — the part I've been spending even more time on — is how to build and manage the context each agent needs. A clean CLAUDE.md, reusable skills and commands, context folders per agent, and the decision of whether to organize agents by language or by area.

That's Part 2. Coming soon.

Leave a Comment

Your email will not be published.