What is a transformer in AI? Why modern AI works the way it does

A person at a desk reviewing four printed vendor proposals laid out next to a laptop and a calculator
TL;DR

A transformer is a neural network architecture, introduced in the 2017 paper Attention Is All You Need, that reads a whole sequence at once and uses attention to figure out which parts of the input matter for understanding which other parts. Every major AI tool you will be pitched in 2026 runs on a transformer or a transformer hybrid. You do not need to build one. You do need to understand it well enough to ask which variant, what context window, and what it costs at your token volume.

Key takeaways

- A transformer is the neural network architecture that powers every major generative AI system in 2026, including GPT-5.5, Claude Opus 4.6, and Gemini 3.1 Pro. - The breakthrough is attention. Each token in an input compares itself against every other token in parallel, which is why transformers scale where older architectures hit walls. - Three variants matter for procurement: encoder-only for understanding and embeddings, decoder-only for generation, encoder-decoder for translation-style tasks. - Training a frontier model now costs $78m to $192m, but inference at GPT-3.5 quality has fallen 280-fold since November 2022, so SMEs buy access, not capacity. - Quadratic attention, memory footprint, and hallucination risk are structural, not bugs. Price them up front when you compare vendors.

The managing director of a fifty-staff UK marketing-services firm sat down last month with four AI vendor proposals on the desk. Vendor A pitched GPT-5.5 for general drafting at £0.025 per thousand input tokens. Vendor B pitched Claude Opus 4.6 for long-context document work. Vendor C pitched Gemini 3.1 Pro for novel reasoning. Vendor D pitched a self-hosted Mistral deployment “to avoid vendor lock-in” at £900 a month for the GPU compute.

All four described their product as “transformer-based AI”. She realised the four had picked the same architecture and were competing on which size and which variant to use for which task. Without a working frame for what a transformer actually is, she could not price the trade-offs. This is the post that gives you that frame.

What is a transformer?

A transformer is a neural network architecture, introduced in the 2017 paper Attention Is All You Need by Vaswani and colleagues, that processes a sequence of information by reading the whole thing at once and using a mechanism called attention to figure out which parts matter for understanding which other parts. Text, images broken into patches, audio, and code can all be fed in.

Before the transformer, the dominant architectures for sequences were recurrent neural networks and long short-term memory networks. They processed one token at a time, passing a hidden state forward like a relay runner with a baton. Training was slow because each step waited for the previous one. Information about early tokens faded as the sequence got longer.

The transformer replaced that relay with parallel, context-aware processing. Every token in an input gets compared against every other token in one pass, with attention weights deciding which comparisons matter. That single change is why training scales efficiently on modern hardware and why capability has grown the way it has since 2017. The Polo Club Transformer Explainer is the cleanest visual walkthrough if you want to see it move.

Why does it matter for your business?

It matters because the transformer is the architecture that made AI commercially useful at scale, and the trade-offs you face when buying AI tools are largely transformer trade-offs in disguise. Model size, context window, inference cost, and hallucination risk are four faces of one architectural choice. Understanding the choice lets you read a vendor pitch the way an engineer reads a spec sheet.

The mechanism inside is called attention. Each token gets converted into three vectors: a query (“what am I looking for?”), a key (“what information do I have?”), and a value (“what context should I contribute?”). The model compares every query against every key, normalises the similarity scores into weights, and uses those weights to take a weighted average of the values. Multiple attention heads run in parallel, learning grammar, semantic similarity, and long-range references.

That mechanism unlocks scaling laws. Research by Kaplan and colleagues in 2020, refined by Hoffmann and colleagues in 2022, showed that performance improves predictably with parameters, training data, and compute. The Stanford AI Index Report 2025 documents that frontier training compute is doubling roughly every five months, which is why GPT-5.5, Claude Opus 4.6, and Gemini 3.1 Pro have moved as fast as they have.

The economics shift matters more for your business than the maths. Training a frontier model now costs between $78m and $192m, per Galileo’s 2025 analysis. SMEs cannot do that. Inference, the cost of using an already-trained model, has fallen more than 280-fold for GPT-3.5-level systems between November 2022 and October 2024, per Stanford HAI. The procurement question stopped being “can I afford AI?” two years ago. It is now which size of transformer fits which task, at what total cost of ownership.

Where will you actually meet it?

You meet transformers in nearly every AI tool you touch, usually without the vendor naming the variant. Three architectural shapes account for almost all commercial use. Encoder-only transformers, modelled on BERT, are trained to understand and classify. Decoder-only transformers, including GPT-5.5, Claude Opus 4.6, and Gemini 3.1 Pro, are trained to generate. Encoder-decoder transformers, like Google’s T5, handle text-to-text tasks.

In practice, encoder-only transformers are the engine behind embeddings, semantic search, and document retrieval. Decoder-only transformers are what you use when you ask ChatGPT to draft an email or Copilot to write code. Document AI tools, including Google’s Document AI used by FibroGen to automate invoice processing for a forty-fold ROI per Google Cloud’s case write-up, combine encoder transformers for classification with encoder-decoder hybrids for structured extraction.

The practical takeaway is that “transformer-based AI” tells you almost nothing on its own. Encoder, decoder, or encoder-decoder is the first question. How many parameters is the second. What context window and what training data is the third. A vendor who cannot answer those three is selling you a brand wrapper around someone else’s model.

When to ask about it, and when to ignore it

Ask when you are signing a contract, comparing two vendor quotes, or building a budget for AI tooling that runs into five figures or more. Ignore it when you are using an off-the-shelf consumer tool for ad hoc work. The dividing line is whether the architectural detail changes the price you pay, the data that leaves your firm, or the workflow that breaks if the model gets retired.

Three structural limits are worth pricing up front in any procurement conversation. The first is quadratic attention. Each token attends to every other token, so doubling context length quadruples the attention computation. A ten-thousand-token context costs you roughly one hundred million attention operations per layer; one hundred thousand tokens costs ten billion, per Weights and Biases’ technical write-up. That is why long-context pricing looks the way it does.

The second is memory footprint. A seventy-billion-parameter model needs roughly 140GB of memory in standard floating-point precision, or 35GB at four-bit quantisation. Self-hosting frontier models means H100 or H200-class GPU spend, which is what Vendor D’s £900-a-month line is paying for. The third is hallucination. Transformers learn patterns and extrapolate from them; they do not consult a ground truth. For regulated work in legal, medical, or financial services, retrieval-augmented generation, fine-tuning, and human review are part of the system, not optional add-ons.

The UK regulatory layer sits over all of this. The ICO publishes detailed AI guidance under UK GDPR. DSIT has published AI principles. The CMA’s AI Foundation Models: Initial Report addresses transformer-based foundation models specifically. UK government research published in February 2026 found only one in six UK businesses currently use AI, with ethical concerns at 80 percent, high cost at 76 percent, and unclear regulation at 72 percent named as the top barriers.

A transformer sits inside a network of related ideas you will hear in the same vendor conversations. A neural network is the broader family the transformer belongs to. Deep learning is the training approach that makes large neural networks practical. A large language model is a decoder-only transformer trained on text. A foundation model is a large pre-trained transformer that other tools are built on top of, often the layer vendors are selling.

Quantisation compresses a transformer’s weights from 32-bit floats to 8-bit or 4-bit integers, cutting inference costs by half or more with minimal accuracy loss. Retrieval-augmented generation grounds a transformer’s output in your own documents, reducing hallucination on factual questions. Chain-of-thought prompting asks a transformer to show its reasoning step by step rather than jumping to an answer, which lifts performance on complex tasks. Each of these is a separate post, and each makes more sense once you know what the transformer underneath is doing.

The marketing-services MD ended that meeting with a clearer brief. Roughly seventy percent of her firm’s AI work was short-form drafting, the natural home of a smaller decoder-only model on the GPT-5 Mini end of the price ladder. Twenty percent was long-context document analysis, where Claude Opus 4.6’s million-token beta window earns its premium. Ten percent was novel reasoning, which Gemini 3.1 Pro handles well. The self-hosted Mistral option came off the table for now, because at her token volume the £900 a month was buying capacity she did not yet need. If you would like a working session to map your own AI tooling against the same questions, book a conversation.

Sources

Vaswani et al. (2017). Attention Is All You Need. The canonical transformer paper. https://arxiv.org/abs/1706.03762 Devlin et al. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. The encoder-only reference. https://arxiv.org/abs/1810.04805 Stanford HAI (2025). The 2025 AI Index Report. Frontier training compute, inference cost decline, open-weight performance gap. https://hai.stanford.edu/ai-index/2025-ai-index-report McKinsey (2025). The state of AI: global survey 2025. The 88 percent regular-AI-use figure. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai UK Government (2026). AI adoption research. The one-in-six UK businesses figure and the barrier breakdown. https://www.gov.uk/government/publications/ai-adoption-research/ai-adoption-research Information Commissioner's Office. Artificial intelligence guidance under UK GDPR. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ Galileo (2025). How much does LLM training cost. The $78m to $192m frontier model figures. https://galileo.ai/blog/llm-model-training-cost Weights and Biases. The problem with quadratic attention in transformer architectures. https://wandb.ai/wandb_fc/tips/reports/The-Problem-with-Quadratic-Attention-in-Transformer-Architectures--Vmlldzo3MDE0Mzcz IBM Think. What is an attention mechanism. The plain-English explainer of query, key, value. https://www.ibm.com/think/topics/attention-mechanism Polo Club Data Science. Transformer Explainer, the visual interactive walkthrough. https://poloclub.github.io/transformer-explainer/

Frequently asked questions

Do I need to understand transformers to use AI in my business?

Not to use the tools. Understanding the architecture helps when you are evaluating vendors, comparing prices, and asking why a one-million-token context window costs ten times more than a one-hundred-thousand-token one. Vendor pitches that say "transformer-based AI" without naming the variant, the parameter count, or the context window are giving you a marketing line, not a procurement answer. Knowing what to ask back is the value.

Is the transformer the same thing as a large language model?

No. A large language model is one product built on a transformer architecture, specifically a decoder-only transformer trained on next-token prediction. Transformers also power encoder-only systems like BERT, used in semantic search and embeddings, and encoder-decoder systems like T5, used in translation and unified text-to-text tasks. When a vendor says "built on a transformer", ask which variant and what it was trained for.

Are there real alternatives to the transformer in 2026?

State-space models like Mamba have emerged as a promising alternative for very long sequences, where the quadratic cost of attention bites hardest. As of May 2026, Mamba-based systems have not displaced transformers at the frontier, and hybrid architectures combining attention with state-space components are starting to appear. For typical SME workloads on context windows of tens of thousands of tokens, the transformer is the working assumption for the next twelve to eighteen months.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation