The chief operating officer of a 25-person UK B2B SaaS firm had three vendor proposals on her desk and a question she could not put down. Vendor A wanted £1,400 a month for “AI-driven dynamic pricing” and would not say what algorithm sat underneath. Vendor B wanted £350,000 for a twelve-month custom build, promising bespoke pricing optimisation. Vendor C was already there, the HubSpot subscription her team paid £700 a month for, where something called “send-time optimisation” had been quietly improving her email engagement rates for the last eight months.
Three pitches, three price points, and a phrase the first two vendors kept using. Reinforcement learning. By the end of the afternoon she had asked me what it actually meant, and whether the version inside HubSpot was the same shape as the version Vendor B wanted to build her.
What is reinforcement learning?
Reinforcement learning is a way of training software to make decisions by trying things and learning from what happens. An agent takes an action inside an environment, observes the consequences, and receives a reward or penalty signal. Over thousands or millions of cycles it learns a policy, a strategy that maximises total reward over time. The agent has no labelled examples to copy from. It learns by doing.
That last sentence is what makes RL the third paradigm of machine learning. Supervised learning trains on labelled examples where every input has a known correct output. Unsupervised learning finds structure in unlabelled data without being told what to look for. Reinforcement learning has neither. It has a goal, an environment to act in, and a feedback loop. The signature trade-off is exploration, trying new actions to see what works, against exploitation, repeating actions known to produce reward. RL fits when the problem is genuinely sequential, the feedback signal is clear, and today’s choice changes tomorrow’s options.
Why it matters for your business
The reason you have heard of RL in 2026 is RLHF, reinforcement learning from human feedback, and it is the technique that turned raw language models into useful assistants. A pretrained model like GPT-3 learns to predict the next token in text. That makes it fluent and knowledgeable, but not helpful. It can be verbose, evasive, confidently wrong. RLHF reshapes the model’s behaviour using a reward signal built from human preference.
The pipeline runs in three stages. Human annotators write ideal example responses to a curated set of prompts and the pretrained model is fine-tuned on them. The fine-tuned model then generates multiple responses to new prompts and human evaluators rank them. Those rankings train a separate reward model that learns to predict human preference at scale. Finally, a reinforcement learning algorithm called Proximal Policy Optimization adjusts the original model to produce outputs the reward model scores higher, with a constraint that prevents drift from the original behaviour.
The results were striking. OpenAI reported that RLHF doubled accuracy on adversarial questions, and that human evaluators preferred a 1.3 billion parameter InstructGPT model over GPT-3 at 175 billion parameters, a model 135 times larger, simply because the smaller one had been RLHF-trained. Every modern frontier assistant, ChatGPT, Claude, Gemini, Grok, uses RLHF or one of its successors as the final alignment stage. When ChatGPT refuses a harmful request consistently, follows your instructions, prefers a particular style, you are seeing RLHF in action.
Where you will meet it
You will meet RL inside vendor platforms long before you ever consider building it. Recommendation engines are the widest touchpoint. Each recommendation is an action, the user’s response is a reward signal, and the system learns a policy that balances familiar items against novel ones. Amazon generates roughly 35% of purchases through its recommendation engine, and Salesforce Einstein Next Best Action embeds the same logic into UK CRM workflows.
Dynamic pricing is the second large use case. Published research shows B2B SaaS firms running RL-based pricing achieve a 14% revenue lift compared with static pricing, and one mid-size firm reported 23% revenue growth in six months alongside a 7% improvement in customer satisfaction, because the system found segments the firm had been undercharging. Email and SMS frequency optimisation in tools like Klaviyo learns when subscriber engagement starts to drop. HubSpot send-time optimisation learns the best moment to email each individual contact. Vercel and similar platforms run contextual bandit A/B testing, shifting traffic mid-experiment towards better-performing variants instead of holding fixed splits.
None of these require you to know RL exists. They require the subscription. The procurement test for distinguishing genuine RL from “AI-powered” marketing is to ask three questions. What feedback signal drives the system? How often does the model retrain? How does it balance exploration against exploitation? A vendor running real RL can answer all three without flinching.
When to ask about it, when to ignore it
Ask about RL when the product is making consequential decisions, particularly about pricing, hiring, lending, or access to services, and when those decisions affect EU customers. From 2 August 2026 the EU AI Act classifies such systems as high-risk under Annex III. Deployer obligations include human oversight, monitoring logs, risk assessment, transparency to affected individuals, and serious-incident reporting. Penalties reach €15m or 3% of global turnover for deployer breaches.
The UK angle is just as live. The Information Commissioner’s Office and Financial Conduct Authority have been clear that existing UK rules already apply to AI-driven decisions. An RL pricing tool that becomes predatory falls under competition law. An RL recruitment screener that encodes historical bias breaches the Equality Act 2010. An RL credit decisioning system must satisfy FCA Consumer Duty obligations. These are not future regimes. They are today’s regulatory exposure for any firm using RL features in vendor products.
Ignore the term when you are evaluating a vendor whose RL is doing low-stakes, easily-checked work, recommending a piece of content, optimising a send time, picking a button colour. The question worth asking there is whether the output improves over time, not which algorithm sits underneath. And ignore custom RL builds entirely if your firm is below £500m revenue. The economics are unforgiving. Custom RL development averages £30,000 to £80,000 in year one for development alone, plus £30,000 to £80,000 annually in cloud infrastructure, with security and compliance adding another 15 to 25% on top. Years two and three typically cost more than year one. Five-year totals land at £200,000 to £500,000. The same outcome inside HubSpot or Salesforce costs the £500 to £2,000 monthly subscription.
Related concepts
Machine learning is the parent category. Reinforcement learning sits inside it, alongside supervised and unsupervised learning. The three paradigms answer different questions and fit different problems, and a competent vendor can tell you which paradigm their product uses without reaching for the marketing deck.
An AI agent is software that takes goal-directed action across multiple steps, often using RL or related techniques to decide what to do next. RL is one of the mechanisms that lets agents learn from outcomes rather than running fixed scripts.
AI alignment is the broader question RLHF was invented to address. How do you make a powerful AI system behave in ways that match human intent? RLHF is one alignment technique. Direct Preference Optimization, RL from AI Feedback, and Reinforcement Learning with Verifiable Rewards are newer variants gaining traction in 2025 and 2026.
The aim here is to give you enough vocabulary that the next vendor pitching “AI-powered optimisation” has to tell you what feedback signal drives the system, how often it learns, and whether your use of it puts you inside the EU AI Act high-risk category. Treat reinforcement learning as a question to ask your vendors, rather than a technology to build yourself.
If you want to think through where RL might already be running inside your stack, and what the procurement and regulatory questions look like for your specific situation, book a conversation.



