The fastest way to lose confidence in an automation is to make it unpredictable. Not just unreliable, unpredictable: sometimes it’s brilliant, sometimes it’s nonsense, and the cost swings around like a roulette wheel.

That’s the trap with LLM-powered workflows. Each run feels cheap, so teams bake a model choice into app code or a one-off script, ship it, and move on. Then the workflow scales, usage creeps up, retries kick in, context windows expand, and suddenly finance is asking what changed.

Nothing changed. You just stopped noticing the tiny purchasing decisions you were making on every run.

This is why I think model routing belongs in the automation layer, not scattered through application logic.

The real problem is not model quality, it’s decision sprawl

When teams hard-code model selection, they usually do it for a sensible reason: they want consistency. Pick one model, tune prompts, ship.

But as soon as you have more than one workflow, or more than one task type inside a workflow, that decision stops being ‘one decision’. It becomes dozens:

  • Summarising a call transcript is not the same risk profile as writing a client-facing email.

  • Extracting invoice fields is not the same as interpreting an ambiguous support ticket.

  • A retry after a timeout is not the same as a retry after a guardrail failure.

When these choices live in app code, they’re hard to see, hard to audit, and hard to change. That’s how you end up with ‘mystery spend’: costs are the emergent behaviour of lots of hidden decisions.

The mechanism: every run is a tiny purchasing decision

An LLM workflow doesn’t just ‘run’. It buys tokens.

On every request you decide:

  • Which model to call (and what its price per input and output token is)

  • How much context to send (which directly drives input tokens)

  • How much you let it talk (max output tokens)

  • How many retries you allow (and under what failure conditions)

If you run a workflow 2,000 times a month, even a £0.20 swing per run is £400. If you have five workflows, that’s the difference between ‘fine’ and a line item you have to justify.

The bit most teams miss is that the price is not only about the model. It’s also about the policy around it: retries, fallbacks, and how quickly you escalate to a more expensive option.

What changed recently: routing is now a workflow problem, not an app problem

n8n has made this significantly easier because you can treat model choice as a first-class node decision. The Model Selector node gives you a single place to route between models, and you can pull in pricing data (for example via OpenRouter) so the workflow can estimate cost per request.

n8n even published a template that compares token costs across hundreds of models using OpenRouter pricing: https://n8n.io/workflows/12100-compare-llm-token-costs-across-350-models-with-openrouter/

And there’s a practical guide to the Model Selector patterns here: https://automategeniushub.com/guide-to-use-the-n8n-model-selector-node/

Those are not ‘nice to have’ features. They are the building blocks of something you can run like an operations system.

The Policy Plane: a routing framework you can actually operate

When I design this for a team, I name the layer on purpose. If you don’t name it, it gets treated like a couple of nodes and forgotten.

I call it the Policy Plane, and it has five steps.

1) Price every run

Before you optimise anything, make cost visible at the point of execution.

This can be as simple as: pull pricing for the candidate models, estimate tokens, and log an estimated £ cost alongside the workflow run ID. The estimate does not need to be perfect to be useful, it needs to be consistent.

The moment you can answer ‘what did this run cost?’ you can start having grown-up conversations about trade-offs.

2) Classify tasks by risk

Most workflows contain multiple tasks, and they do not share the same risk tolerance.

A simple classification that works well:

  • Low-risk: summarise, extract, format, classify (errors are annoying, not dangerous)

  • Medium-risk: drafts for internal review, structured reasoning where a human will check it

  • High-risk: client-facing, financial, legal-adjacent, or anything that can create irreversible downstream actions

This classification is what lets you be cheap where you can, and careful where you must.

3) Start cheap, then escalate on signals

If you always start with the most capable model, you are paying for certainty even when you do not need it.

A better pattern is: start with the cheapest model that is likely to succeed, then escalate only when you see a failure signal.

Signals can include:

  • Low confidence score (if you compute one)

  • Missing required fields in structured output

  • Guardrail trip (PII detected, policy violation, hallucinated citation)

  • Excessive retries or timeouts

The point is not to build a perfect classifier. The point is to make escalation explicit and measurable.

4) Set hard ceilings and a degrade mode

Budgets work when they are enforced in the system, not in someone’s spreadsheet.

I like three ceilings:

  • Per-run max £ (stop or degrade if a single request is about to blow out)

  • Daily budget for the workflow (so one incident does not ruin the month)

  • A degrade mode (for example: switch to summarise-only, or queue for human review) when you hit limits

This is what turns cost control from ‘hope’ into ‘policy’.

5) Build a fallback chain you can change without redeploying

When something goes wrong, the worst outcome is silence. The second worst is uncontrolled retries.

A good fallback chain is boring:

Primary model → secondary model → human review queue.

The crucial detail is operational: you must be able to change this chain quickly without shipping code. If a provider degrades, you switch. If a model starts failing on a specific task, you route around it.

That is an operations responsibility, not a feature request.

What it looks like in n8n (practically)

If you want a concrete mental model, imagine your workflow has a single “routing junction” near the top.

  • First, a small node that tags the request: task type, risk tier, and any known constraints (language, output format).

  • Next, a pricing lookup (cached for the day) so you can attach an estimated £ cost to each candidate model.

  • Then the Model Selector node chooses a primary model based on rules that include both risk tier and cost ceiling.

  • After the call, you run validation: did you get valid JSON, did required fields exist, did it trip a guardrail.

  • If validation fails, you route to the next model in the chain, or into a human review queue.

The key detail is you log the decision, not just the result. Store: selected model, estimated cost, failure reason, and which fallback (if any) fired. That is what makes the system debuggable.

Governance without theatre

“Governance” sounds heavy, but for most teams it comes down to a few sensible defaults:

  • A single config table (even a simple JSON blob) that defines your routes and ceilings

  • An audit log for model choice and cost per run

  • A weekly review of the outliers: the runs that were expensive, slow, or escalated

If you do only one thing, make sure you can answer this question quickly: which workflows are escalating most often, and why.

That is how you catch prompt drift, context bloat, and provider degradation before it becomes a monthly surprise.

Two common failure modes (and how to avoid them)

Failure mode 1: cost telemetry after the fact

If you only look at spend in a billing dashboard, you are always late. You need cost context attached to each run so you can correlate spikes with specific workflows, payload sizes, and failure conditions.

Failure mode 2: ‘fallbacks’ that are just retries

Retries are not a strategy. They are a hope that the same call will magically succeed next time.

A real fallback changes something: different model, smaller context, different prompt, or a human review path.

A simple two-week pilot that proves value

If you want to test this without boiling the ocean, do it like this:

  1. Pick one low-risk workflow that already runs often (support triage, meeting summaries, lead enrichment).

  2. Add pricing visibility and log estimated £ cost per run.

  3. Implement a cheap-first route with one explicit escalation condition.

  4. Add one hard ceiling (per-run max £) and a degrade mode (queue for review).

  5. Measure: average cost per run, failure rate, and time-to-recover from provider issues.

In two weeks you should be able to answer three questions with evidence:

  • What does this workflow cost per month at current volume?

  • What is the cheapest route that still meets quality requirements?

  • What happens when the primary model or provider has a bad day?

That is enough to justify rolling the Policy Plane pattern across the rest of your automation estate.

The position I will defend

Model choice is not a product decision once you run automations at scale, it’s an operations policy and it belongs in your automation layer.

If you keep model selection embedded in app code, you will keep paying for surprises: surprise spend, surprise failure modes, and surprise delays when you need to change something quickly.

If you centralise routing in the workflow, you get something you can actually operate: measurable trade-offs, consistent fallbacks, and a knob you can turn without a release.

Keep Reading