The managing director of a fifty-staff UK marketing-services firm sat down last month with four AI vendor proposals on the desk. Vendor A pitched GPT-5.5 for general drafting at £0.025 per thousand input tokens. Vendor B pitched Claude Opus 4.6 for long-context document work. Vendor C pitched Gemini 3.1 Pro for novel reasoning. Vendor D pitched a self-hosted Mistral deployment “to avoid vendor lock-in” at £900 a month for the GPU compute.
All four described their product as “transformer-based AI”. She realised the four had picked the same architecture and were competing on which size and which variant to use for which task. Without a working frame for what a transformer actually is, she could not price the trade-offs. This is the post that gives you that frame.
What is a transformer?
A transformer is a neural network architecture, introduced in the 2017 paper Attention Is All You Need by Vaswani and colleagues, that processes a sequence of information by reading the whole thing at once and using a mechanism called attention to figure out which parts matter for understanding which other parts. Text, images broken into patches, audio, and code can all be fed in.
Before the transformer, the dominant architectures for sequences were recurrent neural networks and long short-term memory networks. They processed one token at a time, passing a hidden state forward like a relay runner with a baton. Training was slow because each step waited for the previous one. Information about early tokens faded as the sequence got longer.
The transformer replaced that relay with parallel, context-aware processing. Every token in an input gets compared against every other token in one pass, with attention weights deciding which comparisons matter. That single change is why training scales efficiently on modern hardware and why capability has grown the way it has since 2017. The Polo Club Transformer Explainer is the cleanest visual walkthrough if you want to see it move.
Why does it matter for your business?
It matters because the transformer is the architecture that made AI commercially useful at scale, and the trade-offs you face when buying AI tools are largely transformer trade-offs in disguise. Model size, context window, inference cost, and hallucination risk are four faces of one architectural choice. Understanding the choice lets you read a vendor pitch the way an engineer reads a spec sheet.
The mechanism inside is called attention. Each token gets converted into three vectors: a query (“what am I looking for?”), a key (“what information do I have?”), and a value (“what context should I contribute?”). The model compares every query against every key, normalises the similarity scores into weights, and uses those weights to take a weighted average of the values. Multiple attention heads run in parallel, learning grammar, semantic similarity, and long-range references.
That mechanism unlocks scaling laws. Research by Kaplan and colleagues in 2020, refined by Hoffmann and colleagues in 2022, showed that performance improves predictably with parameters, training data, and compute. The Stanford AI Index Report 2025 documents that frontier training compute is doubling roughly every five months, which is why GPT-5.5, Claude Opus 4.6, and Gemini 3.1 Pro have moved as fast as they have.
The economics shift matters more for your business than the maths. Training a frontier model now costs between $78m and $192m, per Galileo’s 2025 analysis. SMEs cannot do that. Inference, the cost of using an already-trained model, has fallen more than 280-fold for GPT-3.5-level systems between November 2022 and October 2024, per Stanford HAI. The procurement question stopped being “can I afford AI?” two years ago. It is now which size of transformer fits which task, at what total cost of ownership.
Where will you actually meet it?
You meet transformers in nearly every AI tool you touch, usually without the vendor naming the variant. Three architectural shapes account for almost all commercial use. Encoder-only transformers, modelled on BERT, are trained to understand and classify. Decoder-only transformers, including GPT-5.5, Claude Opus 4.6, and Gemini 3.1 Pro, are trained to generate. Encoder-decoder transformers, like Google’s T5, handle text-to-text tasks.
In practice, encoder-only transformers are the engine behind embeddings, semantic search, and document retrieval. Decoder-only transformers are what you use when you ask ChatGPT to draft an email or Copilot to write code. Document AI tools, including Google’s Document AI used by FibroGen to automate invoice processing for a forty-fold ROI per Google Cloud’s case write-up, combine encoder transformers for classification with encoder-decoder hybrids for structured extraction.
The practical takeaway is that “transformer-based AI” tells you almost nothing on its own. Encoder, decoder, or encoder-decoder is the first question. How many parameters is the second. What context window and what training data is the third. A vendor who cannot answer those three is selling you a brand wrapper around someone else’s model.
When to ask about it, and when to ignore it
Ask when you are signing a contract, comparing two vendor quotes, or building a budget for AI tooling that runs into five figures or more. Ignore it when you are using an off-the-shelf consumer tool for ad hoc work. The dividing line is whether the architectural detail changes the price you pay, the data that leaves your firm, or the workflow that breaks if the model gets retired.
Three structural limits are worth pricing up front in any procurement conversation. The first is quadratic attention. Each token attends to every other token, so doubling context length quadruples the attention computation. A ten-thousand-token context costs you roughly one hundred million attention operations per layer; one hundred thousand tokens costs ten billion, per Weights and Biases’ technical write-up. That is why long-context pricing looks the way it does.
The second is memory footprint. A seventy-billion-parameter model needs roughly 140GB of memory in standard floating-point precision, or 35GB at four-bit quantisation. Self-hosting frontier models means H100 or H200-class GPU spend, which is what Vendor D’s £900-a-month line is paying for. The third is hallucination. Transformers learn patterns and extrapolate from them; they do not consult a ground truth. For regulated work in legal, medical, or financial services, retrieval-augmented generation, fine-tuning, and human review are part of the system, not optional add-ons.
The UK regulatory layer sits over all of this. The ICO publishes detailed AI guidance under UK GDPR. DSIT has published AI principles. The CMA’s AI Foundation Models: Initial Report addresses transformer-based foundation models specifically. UK government research published in February 2026 found only one in six UK businesses currently use AI, with ethical concerns at 80 percent, high cost at 76 percent, and unclear regulation at 72 percent named as the top barriers.
Related concepts worth knowing alongside this
A transformer sits inside a network of related ideas you will hear in the same vendor conversations. A neural network is the broader family the transformer belongs to. Deep learning is the training approach that makes large neural networks practical. A large language model is a decoder-only transformer trained on text. A foundation model is a large pre-trained transformer that other tools are built on top of, often the layer vendors are selling.
Quantisation compresses a transformer’s weights from 32-bit floats to 8-bit or 4-bit integers, cutting inference costs by half or more with minimal accuracy loss. Retrieval-augmented generation grounds a transformer’s output in your own documents, reducing hallucination on factual questions. Chain-of-thought prompting asks a transformer to show its reasoning step by step rather than jumping to an answer, which lifts performance on complex tasks. Each of these is a separate post, and each makes more sense once you know what the transformer underneath is doing.
The marketing-services MD ended that meeting with a clearer brief. Roughly seventy percent of her firm’s AI work was short-form drafting, the natural home of a smaller decoder-only model on the GPT-5 Mini end of the price ladder. Twenty percent was long-context document analysis, where Claude Opus 4.6’s million-token beta window earns its premium. Ten percent was novel reasoning, which Gemini 3.1 Pro handles well. The self-hosted Mistral option came off the table for now, because at her token volume the £900 a month was buying capacity she did not yet need. If you would like a working session to map your own AI tooling against the same questions, book a conversation.



