What is reinforcement learning (and RLHF)? Why it matters for your business

A woman at a desk reviewing printed vendor proposals next to an open laptop showing a CRM dashboard
TL;DR

Reinforcement learning trains an AI agent to make decisions by trial and error, learning from a reward signal rather than from labelled examples. RLHF is the technique that turned raw language models into useful assistants like ChatGPT. For most UK SMEs, the practical question is not whether to build RL, it is whether the RL already running inside HubSpot, Salesforce, or Klaviyo fits your problem, and whether your use of it now sits inside the EU AI Act high-risk category.

Key takeaways

- Reinforcement learning is the third paradigm of machine learning, alongside supervised and unsupervised learning, where an agent learns by acting and receiving rewards. - RLHF (reinforcement learning from human feedback) is why ChatGPT, Claude, and Gemini behave like assistants instead of next-word predictors. - RL already runs inside vendor products you may already pay for, including HubSpot send-time optimisation, Salesforce Einstein Next Best Action, and Klaviyo frequency tuning. - Building custom RL costs £200,000 to £500,000 over five years for an SME and rarely earns its keep below £500m revenue. Buying it inside a SaaS subscription costs the subscription. - From 2 August 2026, RL systems making consequential decisions about pricing, hiring, lending, or access to services sit in the EU AI Act high-risk category, with deployer obligations and penalties up to €15m or 3% of global turnover.

The chief operating officer of a 25-person UK B2B SaaS firm had three vendor proposals on her desk and a question she could not put down. Vendor A wanted £1,400 a month for “AI-driven dynamic pricing” and would not say what algorithm sat underneath. Vendor B wanted £350,000 for a twelve-month custom build, promising bespoke pricing optimisation. Vendor C was already there, the HubSpot subscription her team paid £700 a month for, where something called “send-time optimisation” had been quietly improving her email engagement rates for the last eight months.

Three pitches, three price points, and a phrase the first two vendors kept using. Reinforcement learning. By the end of the afternoon she had asked me what it actually meant, and whether the version inside HubSpot was the same shape as the version Vendor B wanted to build her.

What is reinforcement learning?

Reinforcement learning is a way of training software to make decisions by trying things and learning from what happens. An agent takes an action inside an environment, observes the consequences, and receives a reward or penalty signal. Over thousands or millions of cycles it learns a policy, a strategy that maximises total reward over time. The agent has no labelled examples to copy from. It learns by doing.

That last sentence is what makes RL the third paradigm of machine learning. Supervised learning trains on labelled examples where every input has a known correct output. Unsupervised learning finds structure in unlabelled data without being told what to look for. Reinforcement learning has neither. It has a goal, an environment to act in, and a feedback loop. The signature trade-off is exploration, trying new actions to see what works, against exploitation, repeating actions known to produce reward. RL fits when the problem is genuinely sequential, the feedback signal is clear, and today’s choice changes tomorrow’s options.

Why it matters for your business

The reason you have heard of RL in 2026 is RLHF, reinforcement learning from human feedback, and it is the technique that turned raw language models into useful assistants. A pretrained model like GPT-3 learns to predict the next token in text. That makes it fluent and knowledgeable, but not helpful. It can be verbose, evasive, confidently wrong. RLHF reshapes the model’s behaviour using a reward signal built from human preference.

The pipeline runs in three stages. Human annotators write ideal example responses to a curated set of prompts and the pretrained model is fine-tuned on them. The fine-tuned model then generates multiple responses to new prompts and human evaluators rank them. Those rankings train a separate reward model that learns to predict human preference at scale. Finally, a reinforcement learning algorithm called Proximal Policy Optimization adjusts the original model to produce outputs the reward model scores higher, with a constraint that prevents drift from the original behaviour.

The results were striking. OpenAI reported that RLHF doubled accuracy on adversarial questions, and that human evaluators preferred a 1.3 billion parameter InstructGPT model over GPT-3 at 175 billion parameters, a model 135 times larger, simply because the smaller one had been RLHF-trained. Every modern frontier assistant, ChatGPT, Claude, Gemini, Grok, uses RLHF or one of its successors as the final alignment stage. When ChatGPT refuses a harmful request consistently, follows your instructions, prefers a particular style, you are seeing RLHF in action.

Where you will meet it

You will meet RL inside vendor platforms long before you ever consider building it. Recommendation engines are the widest touchpoint. Each recommendation is an action, the user’s response is a reward signal, and the system learns a policy that balances familiar items against novel ones. Amazon generates roughly 35% of purchases through its recommendation engine, and Salesforce Einstein Next Best Action embeds the same logic into UK CRM workflows.

Dynamic pricing is the second large use case. Published research shows B2B SaaS firms running RL-based pricing achieve a 14% revenue lift compared with static pricing, and one mid-size firm reported 23% revenue growth in six months alongside a 7% improvement in customer satisfaction, because the system found segments the firm had been undercharging. Email and SMS frequency optimisation in tools like Klaviyo learns when subscriber engagement starts to drop. HubSpot send-time optimisation learns the best moment to email each individual contact. Vercel and similar platforms run contextual bandit A/B testing, shifting traffic mid-experiment towards better-performing variants instead of holding fixed splits.

None of these require you to know RL exists. They require the subscription. The procurement test for distinguishing genuine RL from “AI-powered” marketing is to ask three questions. What feedback signal drives the system? How often does the model retrain? How does it balance exploration against exploitation? A vendor running real RL can answer all three without flinching.

When to ask about it, when to ignore it

Ask about RL when the product is making consequential decisions, particularly about pricing, hiring, lending, or access to services, and when those decisions affect EU customers. From 2 August 2026 the EU AI Act classifies such systems as high-risk under Annex III. Deployer obligations include human oversight, monitoring logs, risk assessment, transparency to affected individuals, and serious-incident reporting. Penalties reach €15m or 3% of global turnover for deployer breaches.

The UK angle is just as live. The Information Commissioner’s Office and Financial Conduct Authority have been clear that existing UK rules already apply to AI-driven decisions. An RL pricing tool that becomes predatory falls under competition law. An RL recruitment screener that encodes historical bias breaches the Equality Act 2010. An RL credit decisioning system must satisfy FCA Consumer Duty obligations. These are not future regimes. They are today’s regulatory exposure for any firm using RL features in vendor products.

Ignore the term when you are evaluating a vendor whose RL is doing low-stakes, easily-checked work, recommending a piece of content, optimising a send time, picking a button colour. The question worth asking there is whether the output improves over time, not which algorithm sits underneath. And ignore custom RL builds entirely if your firm is below £500m revenue. The economics are unforgiving. Custom RL development averages £30,000 to £80,000 in year one for development alone, plus £30,000 to £80,000 annually in cloud infrastructure, with security and compliance adding another 15 to 25% on top. Years two and three typically cost more than year one. Five-year totals land at £200,000 to £500,000. The same outcome inside HubSpot or Salesforce costs the £500 to £2,000 monthly subscription.

Machine learning is the parent category. Reinforcement learning sits inside it, alongside supervised and unsupervised learning. The three paradigms answer different questions and fit different problems, and a competent vendor can tell you which paradigm their product uses without reaching for the marketing deck.

An AI agent is software that takes goal-directed action across multiple steps, often using RL or related techniques to decide what to do next. RL is one of the mechanisms that lets agents learn from outcomes rather than running fixed scripts.

AI alignment is the broader question RLHF was invented to address. How do you make a powerful AI system behave in ways that match human intent? RLHF is one alignment technique. Direct Preference Optimization, RL from AI Feedback, and Reinforcement Learning with Verifiable Rewards are newer variants gaining traction in 2025 and 2026.

The aim here is to give you enough vocabulary that the next vendor pitching “AI-powered optimisation” has to tell you what feedback signal drives the system, how often it learns, and whether your use of it puts you inside the EU AI Act high-risk category. Treat reinforcement learning as a question to ask your vendors, rather than a technology to build yourself.

If you want to think through where RL might already be running inside your stack, and what the procurement and regulatory questions look like for your specific situation, book a conversation.

Sources

GeeksforGeeks (2024). What is reinforcement learning? Plain-English definition of the third ML paradigm. https://www.geeksforgeeks.org/machine-learning/what-is-reinforcement-learning/ AITUDE (2023). Supervised vs unsupervised vs reinforcement learning. The three-paradigm distinction. https://www.aitude.com/supervised-vs-unsupervised-vs-reinforcement/ OpenAI Spinning Up (2020). Introduction to RL. The canonical technical reference for the agent, environment, and policy framing. https://spinningup.openai.com/en/latest/spinningup/rl_intro.html Build Fast With AI (2024). What is RLHF in LLM training? The three-stage RLHF pipeline reference. https://www.buildfastwithai.com/blogs/what-is-rlhf-llm-training Cameron R. Wolfe (2023). The story of RLHF: origins and motivations. Source for the InstructGPT 1.3B vs GPT-3 175B human-preference finding. https://cameronrwolfe.substack.com/p/the-story-of-rlhf-origins-motivations Monetizely (2023). Reinforcement learning for dynamic SaaS pricing. Source for the 14% revenue lift research and the 23% mid-size B2B SaaS case study. https://www.getmonetizely.com/articles/reinforcement-learning-for-dynamic-saas-pricing-the-future-of-subscription-pricing-optimization Salesforce (2024). Einstein Next Best Action implementation guide. Vendor reference for an explicit RL-based recommendation product used by SMEs. https://help.salesforce.com/s/articleView?id=platform.nba_implementation_checklist.htm Klaviyo (2024). Engagement frequency optimisation. Vendor reference for RL-style frequency tuning inside a tool widely used by UK SMEs. https://help.klaviyo.com/hc/en-us/articles/10948996125083 SmartDev (2024). Generative AI implementation costs for SMEs. Source for the £200,000 to £500,000 five-year custom build figure. https://smartdev.com/gen-ai-implementation-cost-sme/ EU AI Act (2024). Annex III high-risk AI systems. Regulatory anchor for consequential RL decisions on employment, credit, pricing, and access. https://artificialintelligenceact.eu/annex/3/ Osborne Clarke (2026). Regulatory outlook January 2026: artificial intelligence. Reference for the 2 August 2026 high-risk implementation date and deployer obligations. https://www.osborneclarke.com/insights/regulatory-outlook-january-2026-artificial-intelligence

Frequently asked questions

How do I tell whether a vendor is genuinely doing reinforcement learning or just using "AI-powered" as marketing?

Ask three questions. What feedback signal drives the system, clicks, conversions, revenue, or something else? How often does the model retrain on that signal? How does it balance exploration of new options against exploitation of known winners? A vendor running real RL can answer all three. A vendor running rule-based logic with an "AI" label cannot. If the answers are vague, assume the system is static, not adaptive.

Should I build a custom reinforcement learning system for my business?

Almost certainly not, unless your firm is above £500m revenue and the problem is genuinely sequential at scale. Custom RL averages £200,000 to £500,000 over five years for an SME, and years two and three usually cost more than year one as scaling and feature work compound. Last-mile delivery routing for a logistics firm with hundreds of daily drops is one of the few SME cases where the maths works. For pricing, recommendations, and email cadence, the RL inside vendor platforms is far cheaper and already battle-tested.

Does the EU AI Act apply to my business if I am UK-based?

Yes, if you serve EU customers. The Act applies based on where the system is used and who it affects, not where the firm is incorporated. From 2 August 2026, RL systems making consequential decisions about employment, credit, pricing, or access to goods are classified as high-risk under Annex III. Deployer obligations include human oversight, monitoring, transparency to affected individuals, and serious-incident reporting. Penalties reach €15m or 3% of global turnover for deployer breaches.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation