There is a moment every organisation who has shipped an AI agent will recognise.
The demo worked beautifully. The prototype ran for weeks without issue. You showed it to the team, got buy-in, and shipped it. Then the first AI invoice arrived and you found yourself doing the maths twice because the number didn't seem possible.
This is not bad luck. It is a predictable consequence of a small number of architectural decisions made during the build — decisions that look harmless in a prototype and expensive in production. This article names them, shows you what they cost, and explains what the fix looks like in plain terms.
The Number That Surprises People
Before getting into causes, it helps to understand why AI agent costs feel so unpredictable compared to other software costs.
A traditional SaaS tool has a fixed monthly fee. A cloud server has a cost per hour that scales linearly with usage. An AI agent has a cost per task that varies based on how much text gets processed in each run — and that number can swing by a factor of ten depending on how the agent is built, not just how much it is used.
A poorly built agent processing 500 tasks a day can cost more than a well-built agent processing 5,000 tasks a day. Usage volume is not the primary driver. Architecture is.
Here is a real example from a system we rebuilt after discovering few issues. Same task. Same AI model. Same volume. Predicted costs:
Before: 48,000 tokens per task · £0.43 per task · £215/day · £6,450/month · £77,400/year
After: 11,000 tokens per task · £0.10 per task · £50/day · £1,500/month · £18,000/year
The difference — £59,400 per year — came from fixing three architectural problems. No new features. No change to what the agent does. Just how it does it.
Problem One: The Agent That Never Stops Talking to the AI
The most common and most expensive pattern in AI agent design is what engineers call the uncontrolled tool loop.
Here is what it looks like in practice. Your agent needs to gather five pieces of information before it can produce a result. In a poorly architected system, this happens sequentially: the agent asks the AI what to do next, the AI says "fetch piece of information one," the agent fetches it, sends the result back to the AI, the AI says "now fetch piece of information two," and so on. Five round-trips to the AI model just to collect data. Then another round-trip for the actual analysis.
Every one of those round-trips costs tokens. And crucially, every round-trip inherits all the context from the previous ones — so by round-trip five, you are sending the AI the full conversation history of everything that happened in round-trips one through four. The context window grows with every step.
A well-built agent separates data collection from AI reasoning entirely. All five pieces of information are gathered simultaneously using regular code — no AI involved. Then the AI is called exactly once, given everything it needs, and asked to produce the result. One round-trip instead of six. Context window stays flat instead of growing with every step.
The cost impact in plain terms: an agent making six AI calls per task versus one AI call per task, running 500 tasks a day, is paying for roughly five unnecessary AI calls on every single task. At typical API rates, that is the difference between a £1,500 monthly bill and a £9,000 monthly bill. For the same output.
Problem Two: The Context Window That Ate Everything
Even if your agent only calls the AI once per task, there is a second cost driver that is easy to miss: how much text you send in that single call.
AI models charge by the token — roughly one token per four characters of text. Everything you send to the model in a single call contributes to the bill: the instructions you give it, the data it needs to analyse, the conversation history if there is one, and the examples you include to guide its behaviour.
The two most common ways this gets out of hand:
Raw data dumps. An agent that scrapes a company website and sends the entire HTML to the AI for analysis might be sending 50,000 tokens of raw markup when only 3,000 tokens of actual content are relevant. Extracting the useful text before sending it to the AI — a five-line code change — eliminates the rest. We routinely see 60 to 80 per cent context window reductions from this single change.
Unbounded tool results. If your agent calls external APIs or searches the web as part of its workflow, the results of those calls get passed to the AI. A single verbose API response — a long WHOIS record, a detailed news article, a full LinkedIn profile dump — can consume the equivalent of several thousand tokens. Without a size cap on each result, one unusually large response can double the cost of the entire task.
Neither of these requires rebuilding your agent. They require adding constraints that were never put there in the first place.
Problem Three: Paying Full Price for the Same Thing Every Time
This one is less widely known and almost entirely unaddressed in AI agents built without specialist input.
Most other AI providers offer prompt caching — a mechanism that dramatically reduces the cost of processing text that appears repeatedly across tasks. If your agent uses a detailed set of instructions that gets sent with every task, and those instructions are identical each time, you can pay full price on the first call and a fraction of that on every subsequent call within the caching window.
For most AI agents, the instructions sent to the model — what the agent is, what it should do, what format it should produce — are the same on every single task. Without caching enabled, you pay to process those instructions every single time. With caching enabled, you pay once and get a discount on every repetition.
The discount varies by provider, but for high-volume agents, prompt caching alone typically reduces input token costs by 30 to 50 per cent on the repeated portions. On a £6,000 monthly bill, that is £1,800 to £3,000 recovered from a configuration change that takes an hour to implement.
Most prototypes and agency-built agents do not have this enabled. Not because it is hard — because whoever built the agent did not know it existed or did not prioritise it during the initial build.
What This Looks Like on a P&L
It is worth being concrete about the financial profile of an AI agent at different scales, because the decisions feel low-stakes when you are processing a hundred tasks a day and very high-stakes when you are processing ten thousand.
At 200 tasks per day — a modest automation for a small sales or operations team — the difference between a well-built and a poorly built agent is typically £2,000 to £4,000 per month. Meaningful, but not business-critical.
At 2,000 tasks per day — a team-wide deployment or a customer-facing automation — the same architectural problems cost £20,000 to £40,000 per month. At that scale, the AI API bill becomes one of your largest operational line items, and it is entirely variable — meaning a spike in usage produces a cost spike with no warning and no cap unless you have explicitly built one in.
At 20,000 tasks per day — which is not unusual for a product that has found traction — an uncontrolled tool loop combined with no context management and no caching is the kind of cost structure that shows up in a board meeting as a unit economics problem and prompts uncomfortable questions about whether the product is viable at scale.
The architectural decisions that determine which of these scenarios you are in were made during the first two weeks of the build. They are fixable — but fixing them after the fact is significantly more disruptive than building them correctly from the start.
The Three Questions Worth Asking Right Now
If you have an AI agent running in your business — or you are planning to build one — these three questions will tell you quickly whether you have a cost problem waiting to surface.
One: How many times does your agent call the AI model per task? If the answer is more than two or three, ask why each call is necessary. Data collection and AI reasoning should be separated. Every unnecessary round-trip is a direct cost with no benefit.
Two: Does anything in your agent send raw, unprocessed data to the AI? Scraped web pages, API responses, document contents. If the answer is yes, ask whether the AI actually needs all of it or whether the relevant portion could be extracted first. In almost every case, it can.
Three: Does your agent use prompt caching? If the person who built it cannot answer this question confidently, the answer is probably no.
None of these questions require a full audit to answer. They are the difference between an agent that runs cost-efficiently at scale and one that turns into a budget conversation at your next board meeting.
The Broader Point
The firms that will get the most value from AI agents over the next three years are not necessarily the ones that move fastest. They are the ones that build with operational discipline from the start — treating token cost as a first-class engineering constraint rather than something to worry about later.
Later, in our experience, is always more expensive than now.
QualitaX builds production-grade agentic systems for B2B businesses. If your AI agent costs are higher than they should be — or you want to build something that won't surprise you with the bill — get in touch.