SaaS AI vs self-hosted: which one your business actually needs

A person at a desk reviewing a pricing comparison on a laptop with a printed spreadsheet beside them
TL;DR

For the typical UK SME the right answer is SaaS AI first, self-hosted later or never. Three thresholds justify the migration: monthly token volume above roughly 50 million, regulatory or contractual data residency that rules out US cloud APIs, or sub-200ms latency that the cloud round-trip cannot meet. Below all three, SaaS wins on cost, time-to-value and operational simplicity. Above any one of them, the maths and the compliance picture change.

Key takeaways

- SaaS AI is the right starting point for the typical UK SME. Self-hosted is the migration step. - Three thresholds justify migration: ~50M tokens a month, hard data residency, sub-200ms latency. - SaaS wins below those thresholds on cost, speed and operational simplicity. - The hybrid pattern, SaaS for breadth and self-hosted for sensitive or high-volume workloads, is the 2026 default for firms that cross one threshold but not all three. - Multi-model gateways (LiteLLM, Portkey, OpenRouter) make hybrid feasible without re-engineering.

A finance director showed me two vendor quotes side by side. One was a SaaS AI subscription at £400 a month. The other was a self-hosted setup at £18,000 for the build and £1,400 a month for cloud GPUs. He pointed at both and said, “Same use case. Two different worlds. Which one am I supposed to be looking at?”

It is the right question. By 2026 nearly every business AI use case can be solved with either path, and the marketing on both sides will tell you theirs is the obvious answer. The plain-English version: SaaS until one of three things is true, self-hosted when it is.

The choice you’re facing

SaaS AI means renting access to a foundation model through a vendor’s API. OpenAI for GPT, Anthropic for Claude, Google for Gemini, plus sector-specific platforms. You pay per token, the vendor runs the infrastructure, you switch on or off in minutes. A typical UK SME running 10 million tokens a month sits in the £100 to £500 range.

Self-hosted means running the model yourself, on your own servers or rented cloud GPUs. You pick an open-weight model (Llama, Mistral, DeepSeek, Qwen), deploy an inference engine like vLLM or NVIDIA NIM, and pay for the hardware whether you use it or not. A two-GPU cluster running a quantized 70-billion-parameter model costs around £1,200-£1,500 a month in 2026, before staff time.

Three thresholds decide which side of the line your use case lands on: how much you use the model, what data you put into it, and how quickly you need the answer. Below all three thresholds, SaaS is almost always the right call. Above any one, the question opens up.

A middle category sits between the two paths: managed inference services (Modal, Baseten, TrueFoundry) and multi-model gateways (LiteLLM, Portkey, OpenRouter). They let you self-host without running the infrastructure, or run a hybrid setup without re-engineering. The hybrid pattern is the 2026 default for firms that cross one threshold but not all three.

When SaaS is the right answer

SaaS is the right answer for the exploratory phase, for low to moderate volume, and for tasks where the data is not regulated and the latency is not critical.

The exploratory phase is the easiest case. A team testing a chatbot, automating classification or drafting summaries a few hundred at a time gains nothing from owning infrastructure. SaaS lets you stand up a working version in days, validate the use case, and abandon it cheaply if it does not pan out. Time-to-value is the lever, not unit cost.

Low to moderate volume is the typical state. A 20-person recruitment firm using a SaaS API to generate job descriptions, rank CVs and draft candidate feedback runs around 15 million tokens a month. The bill is £200 to £300. The cost of a self-hosted alternative, even before staff time, is several times higher. Below the 50-million-token mark, SaaS wins on raw maths.

Mainstream low-sensitivity use cases also stay on SaaS. Drafting marketing copy, summarising public meeting notes, generating product descriptions from photos. The data does not need a residency commitment, the use case does not need sub-second latency, and the SaaS providers have already done the work of optimising the model for these tasks at scale.

Vendor switching matters here too. A SaaS deployment can swap from OpenAI to Anthropic to Gemini with a configuration change, especially through a gateway like LiteLLM. That flexibility costs nothing in SaaS and significant engineering time in self-hosted.

When self-hosted is the right answer

Self-hosted becomes the right answer when one of three thresholds is crossed: high volume, hard residency, or strict latency.

High volume is the financial threshold. Above roughly 50 million tokens a month of continuous usage, cloud GPU rental costs drop below per-token SaaS pricing. Above 100 million the gap is dramatic. A media company at 200 million tokens a month for news briefs and personalised newsletters can cut its bill by 60-80% by self-hosting on two cloud GPUs.

Hard data residency is the regulatory threshold. UK financial services firms under FCA model risk supervision, NHS Trusts processing patient data, and any business with customer contracts that explicitly forbid US cloud processing land here. The ICO’s 2026 transfer guidance requires a Transfer Impact Assessment when personal data moves to US-based SaaS APIs, and the NCSC recommends UK-sovereign or on-premises infrastructure for sensitive workloads. SaaS does not satisfy these requirements without significant contractual work; self-hosted on UK infrastructure does.

Strict latency is the operational threshold. SaaS API calls typically add 200-500ms of network round-trip. For interactive voice agents, edge devices, real-time monitoring or anything where the user feels every extra second, the round-trip is the bottleneck. Self-hosted inference on local or edge infrastructure can respond in under 100ms.

The vendor lock-in case sits underneath all three. A business that is wholly dependent on one SaaS provider inherits that provider’s pricing, deprecation schedule and policy decisions. A self-hosted open-weight setup gives you the option to swap models, swap cloud providers or move on-premises without renegotiating contracts. For mission-critical workflows, that optionality is part of the case.

What it costs to get wrong

Both directions have a failure mode and they look opposite to each other.

SaaS bill shock is the first. A team that starts at £400 a month and finds productive use cases everywhere can be at £5,000 to £10,000 a month within a year. Without a planned trigger, that is unbudgeted spend and the team is too embedded to switch quickly. The fix is to forecast token volume and set a price (say, £3,000 a month) at which you actively evaluate alternatives.

Self-hosted infrastructure debt is the opposite failure. A team that migrates too early ends up running GPU drivers, model serving and security patches without the platform engineering capacity to do it well. An in-house engineer or outsourced platform team can cost £5,000-£15,000 a month, more than the SaaS bill the migration was meant to displace. The fix is a managed inference service, or stay on SaaS until volume demands the move.

Premature migration is the subtle version of the same trap. A use case validated for three months on SaaS does not have enough data to justify a 24-month infrastructure commitment. Migrate when volume has been above the threshold for at least three consecutive months and demand is expected to stay there.

Vendor lock-in cuts both ways. SaaS lock-in shows up in pricing changes and deprecations you cannot opt out of. Self-hosted lock-in shows up when you have built around a single open-weight family and a better one appears. The mitigation is the same: route through a multi-model gateway from day one, keep prompts portable, and negotiate clear exit and data-export terms.

What to ask before you decide

Five questions, in order, before signing.

One: what is your forecast monthly token volume in twelve months, not today? The decision is about the next year, not the next month. If your forecast crosses the 50-million-token mark, plan for migration even if you start on SaaS.

Two: what data will the AI process, and where do your customer contracts and regulators say that data can live? Read the customer Master Service Agreements before you read the vendor’s pricing page. If a customer prohibits US cloud processing of their data, the SaaS option is off the table for that workload regardless of cost.

Three: what is your acceptable latency? If the answer is “a couple of seconds is fine”, SaaS is uncontested. If the answer is “the user has to feel like they are having a conversation”, self-hosted or edge becomes part of the picture.

Four: who runs the infrastructure if you go self-hosted? If the answer is “we will work that out later”, you are not ready for self-hosted. Use a managed inference service or stay on SaaS until you have the capacity in-house.

Five: what is your switching plan? Both SaaS and self-hosted lock you in differently. A multi-model gateway like LiteLLM or Portkey, plus portable prompts, plus a clear data-export clause in your contracts, is the 2026 baseline for keeping options open.

The honest answer for UK SMEs in 2026 is start on SaaS, set up a gateway, watch the volume and the contracts, and migrate workloads that cross the thresholds. The right architecture is rarely all of one or the other.

Sources

Frequently asked questions

At what monthly token volume does self-hosted start to make sense?

The conservative cross-over in 2026 sits around 50 million tokens a month for a continuously-used workload. Below that, the GPU rental cost of a self-hosted cluster usually exceeds the SaaS API bill. Above it, and especially above 100 million, the maths flips. The cross-over depends on your model choice and how steady the usage is.

Does the ICO actually prohibit sending data to OpenAI or Anthropic?

No. UK GDPR allows international transfers under specified safeguards. The ICO's 2026 transfer guidance does require a Transfer Impact Assessment when personal data flows to US-based SaaS, and customer contracts in regulated sectors increasingly forbid it. Permitted does not mean frictionless. The compliance work is real.

What is the smallest realistic self-hosted setup for an SME?

A two-GPU cluster running a quantized open-weight model (Llama 3.1 70B at four-bit, or Mistral 8x22B) on rented cloud GPUs. Around £1,200-£1,500 a month at 2026 prices. Managed inference services like Modal, Baseten or TrueFoundry handle the operational layer if you do not have a platform engineer.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation