A finance director showed me two vendor quotes side by side. One was a SaaS AI subscription at £400 a month. The other was a self-hosted setup at £18,000 for the build and £1,400 a month for cloud GPUs. He pointed at both and said, “Same use case. Two different worlds. Which one am I supposed to be looking at?”
It is the right question. By 2026 nearly every business AI use case can be solved with either path, and the marketing on both sides will tell you theirs is the obvious answer. The plain-English version: SaaS until one of three things is true, self-hosted when it is.
The choice you’re facing
SaaS AI means renting access to a foundation model through a vendor’s API. OpenAI for GPT, Anthropic for Claude, Google for Gemini, plus sector-specific platforms. You pay per token, the vendor runs the infrastructure, you switch on or off in minutes. A typical UK SME running 10 million tokens a month sits in the £100 to £500 range.
Self-hosted means running the model yourself, on your own servers or rented cloud GPUs. You pick an open-weight model (Llama, Mistral, DeepSeek, Qwen), deploy an inference engine like vLLM or NVIDIA NIM, and pay for the hardware whether you use it or not. A two-GPU cluster running a quantized 70-billion-parameter model costs around £1,200-£1,500 a month in 2026, before staff time.
Three thresholds decide which side of the line your use case lands on: how much you use the model, what data you put into it, and how quickly you need the answer. Below all three thresholds, SaaS is almost always the right call. Above any one, the question opens up.
A middle category sits between the two paths: managed inference services (Modal, Baseten, TrueFoundry) and multi-model gateways (LiteLLM, Portkey, OpenRouter). They let you self-host without running the infrastructure, or run a hybrid setup without re-engineering. The hybrid pattern is the 2026 default for firms that cross one threshold but not all three.
When SaaS is the right answer
SaaS is the right answer for the exploratory phase, for low to moderate volume, and for tasks where the data is not regulated and the latency is not critical.
The exploratory phase is the easiest case. A team testing a chatbot, automating classification or drafting summaries a few hundred at a time gains nothing from owning infrastructure. SaaS lets you stand up a working version in days, validate the use case, and abandon it cheaply if it does not pan out. Time-to-value is the lever, not unit cost.
Low to moderate volume is the typical state. A 20-person recruitment firm using a SaaS API to generate job descriptions, rank CVs and draft candidate feedback runs around 15 million tokens a month. The bill is £200 to £300. The cost of a self-hosted alternative, even before staff time, is several times higher. Below the 50-million-token mark, SaaS wins on raw maths.
Mainstream low-sensitivity use cases also stay on SaaS. Drafting marketing copy, summarising public meeting notes, generating product descriptions from photos. The data does not need a residency commitment, the use case does not need sub-second latency, and the SaaS providers have already done the work of optimising the model for these tasks at scale.
Vendor switching matters here too. A SaaS deployment can swap from OpenAI to Anthropic to Gemini with a configuration change, especially through a gateway like LiteLLM. That flexibility costs nothing in SaaS and significant engineering time in self-hosted.
When self-hosted is the right answer
Self-hosted becomes the right answer when one of three thresholds is crossed: high volume, hard residency, or strict latency.
High volume is the financial threshold. Above roughly 50 million tokens a month of continuous usage, cloud GPU rental costs drop below per-token SaaS pricing. Above 100 million the gap is dramatic. A media company at 200 million tokens a month for news briefs and personalised newsletters can cut its bill by 60-80% by self-hosting on two cloud GPUs.
Hard data residency is the regulatory threshold. UK financial services firms under FCA model risk supervision, NHS Trusts processing patient data, and any business with customer contracts that explicitly forbid US cloud processing land here. The ICO’s 2026 transfer guidance requires a Transfer Impact Assessment when personal data moves to US-based SaaS APIs, and the NCSC recommends UK-sovereign or on-premises infrastructure for sensitive workloads. SaaS does not satisfy these requirements without significant contractual work; self-hosted on UK infrastructure does.
Strict latency is the operational threshold. SaaS API calls typically add 200-500ms of network round-trip. For interactive voice agents, edge devices, real-time monitoring or anything where the user feels every extra second, the round-trip is the bottleneck. Self-hosted inference on local or edge infrastructure can respond in under 100ms.
The vendor lock-in case sits underneath all three. A business that is wholly dependent on one SaaS provider inherits that provider’s pricing, deprecation schedule and policy decisions. A self-hosted open-weight setup gives you the option to swap models, swap cloud providers or move on-premises without renegotiating contracts. For mission-critical workflows, that optionality is part of the case.
What it costs to get wrong
Both directions have a failure mode and they look opposite to each other.
SaaS bill shock is the first. A team that starts at £400 a month and finds productive use cases everywhere can be at £5,000 to £10,000 a month within a year. Without a planned trigger, that is unbudgeted spend and the team is too embedded to switch quickly. The fix is to forecast token volume and set a price (say, £3,000 a month) at which you actively evaluate alternatives.
Self-hosted infrastructure debt is the opposite failure. A team that migrates too early ends up running GPU drivers, model serving and security patches without the platform engineering capacity to do it well. An in-house engineer or outsourced platform team can cost £5,000-£15,000 a month, more than the SaaS bill the migration was meant to displace. The fix is a managed inference service, or stay on SaaS until volume demands the move.
Premature migration is the subtle version of the same trap. A use case validated for three months on SaaS does not have enough data to justify a 24-month infrastructure commitment. Migrate when volume has been above the threshold for at least three consecutive months and demand is expected to stay there.
Vendor lock-in cuts both ways. SaaS lock-in shows up in pricing changes and deprecations you cannot opt out of. Self-hosted lock-in shows up when you have built around a single open-weight family and a better one appears. The mitigation is the same: route through a multi-model gateway from day one, keep prompts portable, and negotiate clear exit and data-export terms.
What to ask before you decide
Five questions, in order, before signing.
One: what is your forecast monthly token volume in twelve months, not today? The decision is about the next year, not the next month. If your forecast crosses the 50-million-token mark, plan for migration even if you start on SaaS.
Two: what data will the AI process, and where do your customer contracts and regulators say that data can live? Read the customer Master Service Agreements before you read the vendor’s pricing page. If a customer prohibits US cloud processing of their data, the SaaS option is off the table for that workload regardless of cost.
Three: what is your acceptable latency? If the answer is “a couple of seconds is fine”, SaaS is uncontested. If the answer is “the user has to feel like they are having a conversation”, self-hosted or edge becomes part of the picture.
Four: who runs the infrastructure if you go self-hosted? If the answer is “we will work that out later”, you are not ready for self-hosted. Use a managed inference service or stay on SaaS until you have the capacity in-house.
Five: what is your switching plan? Both SaaS and self-hosted lock you in differently. A multi-model gateway like LiteLLM or Portkey, plus portable prompts, plus a clear data-export clause in your contracts, is the 2026 baseline for keeping options open.
The honest answer for UK SMEs in 2026 is start on SaaS, set up a gateway, watch the volume and the contracts, and migrate workloads that cross the thresholds. The right architecture is rarely all of one or the other.



