essay

Every architectural decision is a bill waiting to arrive. The small team that knows which is the one that keeps shipping.

2026-04-22 · 7 min read · chris olson

The small teams I work with cannot afford a FinOps practice in the way Big Tech means the term. There is no cost-engineering org, no monthly cost review with department heads, no dashboard that shows spend per feature team. What they can afford — and what I insist on — is a mindset. The architect is the first line of defense against a bill that will not make sense in six months.

This post is how I think about that on the projects I ship and the projects I advise on.

01 three bills that matter

Most cloud spend on a small team breaks into three lines, and they do not compete in the way you would expect.

  • Compute. The always-on services, the background workers, the databases. The bill that grows linearly with uptime.
  • Model. Every token you send to a frontier model, every embedding you cache, every speech-to-text second you transcribe. The bill that grows linearly with traffic, at a multiple most teams underestimate.
  • People. Not a cloud line item, but the one that determines whether the other two matter. Engineers who cannot explain the bill cannot reduce it.

The failure mode I see most often: teams obsess over the first bill — VM sizing, cluster autoscaling, reserved instances — while the second bill grows without anyone looking at it. By the time the CFO asks, the answer is "we are paying Anthropic $38k a month because someone wired up a naive retrieval loop last quarter."

02 model choice and prompt caching

The easiest win in a modern stack is still the one most teams skip: pick the smallest model that clears the bar, and cache everything you can.

Picking the smallest model is not about being cheap. It is about being deliberate. Most agent pipelines I inherit are running Opus on every turn, including turns that are literally extracting a JSON field from a sentence. That is a 10x cost multiplier for a task Haiku handles without blinking. The engineering work is half an hour; the savings compound forever.

Prompt caching is the other half. If your system prompt is 4k tokens and it repeats across every request, you are paying for it every time. The Anthropic and Google caches both reward the same behavior: stable prefixes, thoughtful boundaries. I have moved pipelines from $0.12 per call to $0.015 per call without changing what the model does — only changing how I present the context. That is cost engineering that the user never sees.

03 observability of dollars

The same team that can tell me the p99 latency of a request cannot usually tell me its cost. That is a gap in the telemetry, not a limit of the platform.

My rule on every agent pipeline I ship: every tool call, every model call, every retrieval step emits a cost.micro_usd attribute on its OTel span. The span already exists for latency tracking. Adding the cost attribute costs nothing and gives me a per-request dollar amount queryable in the same tool I use for latency. When a bill spikes I can trace it to a specific user flow, a specific prompt, a specific day. When I cannot, I am guessing.

This is the single cheapest FinOps practice I know. It also forces a cultural habit: every engineer on the team sees the cost of the thing they just shipped. You cannot fix a bill nobody reads.

04 what a finops-aware stack looks like

On the projects I ship today, the stack choices are cost choices first. The specifics:

  • Backend language. Rust where the service is long-lived and compute-sensitive, Go everywhere else. Why I reach for Rust and why I reach for Go cover the decision tree. A Rust service running at 256 MiB can do the work of a Go service at 512 MiB. At small scale this is invisible; at real traffic it is the difference between two instances and four.
  • Model tier. Haiku or Gemini Flash for extraction, classification, routing — the work that does not need a reasoning model. Opus or Sonnet only for the turns where the reasoning is the product. Never the other way around by default.
  • Frontend. Next.js for content-heavy surfaces where the edge cache pays for itself; Verve for sites like this one where the binary is the CDN's problem and the per-island WASM chunks are small enough to ship lazily. The question is: who is paying for every byte, every render?
  • Mobile. Flutter, one codebase. The bill for two native teams is the largest cost line a small team never talks about. Why Flutter for mobile has the math.

None of these are about being cheap. They are about paying for what the product actually needs.

05 the cost story I tell clients

When a client asks me to review an architecture, I run the same three questions:

  1. What is the bill going to look like at 10x today's traffic? If the answer is "I am not sure," we stop there and instrument. You cannot optimize what you cannot see.
  2. Which of those costs scales linearly with users, and which scales with engineering headcount? These fix differently. The first is a knob you turn; the second is a hire you do or do not make.
  3. What is the cheapest thing I can change this quarter that will still matter next quarter? Most optimization work is short-half-life. A caching strategy that you throw away in three months was not worth building. A model-tiering rule you enforce in the agent harness pays for itself every request forever.

That is the entire consulting practice in three bullets. The honest answer is not "use Rust" or "switch to Flutter." The honest answer is: know what you are paying for, make every engineer on the team know too, and choose the stack that keeps the bill from surprising you.

A small team that operates this way can out-execute a big team that does not. I have seen it happen more than once. It is the argument for the mindset in one sentence.