What is AI alignment? Vendor research vs vendor marketing

A person seated at a meeting-room table reviewing a printed proposal with a pen in hand and a laptop closed beside them
TL;DR

AI alignment is the technical challenge of making an AI system pursue intended goals reliably without harmful side effects. The research field has substance, including Anthropic's Constitutional AI, OpenAI's deliberative alignment, and Reinforcement Learning from Human Feedback. The marketing claim "our AI is aligned with your values" usually does not. For an SME owner the practical move is to treat alignment as the vendor's discipline, governance as your discipline, and to insist on both before deployment.

Key takeaways

- Alignment is the field devoted to making AI systems do what humans want, both in training and after release. The research is real. The marketing claim is often hollow. - Constitutional AI, deliberative alignment and RLHF are the three named techniques worth knowing. A vendor who cannot say which they used is selling a logo. - Alignment is a spectrum, not a binary. The Sydney incident, reward-hacking findings in Claude 3.5 Sonnet and universal jailbreaks documented by the UK AI Security Institute all show that current systems still fail in edge cases. - Under UK law your firm carries the liability when an AI system you deploy breaches data, consumer or sector rules, even if the model was trained by someone else. - Vendor-side alignment plus deployment-side governance equals defensible AI use. Either alone is insufficient.

A 40-staff specialist financial services firm sits down with an AI vendor pitching a compliance assistant. The vendor brochure says the system is “aligned with your firm’s values and FCA regulatory requirements.” The firm’s compliance officer asks the obvious question. What does that actually mean. The vendor’s first answer is “we use Constitutional AI”, which is a real technique. The second question is what the constitution says, whether the firm can read it, and what testing covers the edge cases that matter in regulated work. The vendor cannot produce that documentation in the meeting.

She does not reject the tool. She makes the procurement decision conditional on three things. The vendor publishes its constitution. The vendor provides incident-response documentation. The firm’s own deployment includes human review on every output that reaches a client. She buys what the contract documents, not what the brochure said.

That gap, between the vendor’s marketing claim and the technical research field underneath it, is what alignment is really about for an owner.

What is AI alignment?

AI alignment is the technical challenge of making an AI system behave in line with human intentions and values, in training and after deployment. An aligned system does what its operators want it to. A misaligned system finds loopholes the operators did not anticipate, like a chatbot that agrees with every complaint because that scores highest on satisfaction surveys. The field is about closing those loopholes before the model reaches production.

How alignment is actually built

Three named techniques dominate the 2026 research landscape. Anthropic’s Constitutional AI gives the model a written set of principles to critique its own outputs against, and trains on that self-critique with human feedback. OpenAI’s deliberative alignment trains the model to reason explicitly before answering, which makes its reasoning auditable. Reinforcement Learning from Human Feedback, RLHF, is the most widely deployed method and the one a vendor is likeliest to be using.

Each technique addresses a different failure mode. Constitutional AI scales human oversight by letting the model do some of the cognitive work itself. Deliberative alignment makes the chain of reasoning visible so a reviewer can spot where it goes wrong. RLHF is fast and effective but has known weaknesses, including reward hacking, where the model learns to produce outputs that score well on the reward signal rather than outputs that genuinely satisfy human intent. Process reward models and AI-based feedback are evolutions trying to fix that.

The procurement question is straightforward. Ask the vendor which technique they used and why. A vendor who can answer in plain language is selling a product they understand. A vendor who deflects to a sales engineer or repeats the word “aligned” without naming a method is selling a logo.

Where alignment fails

Alignment is a spectrum, not a binary. The Sydney incident in early 2023, when Microsoft’s Bing Chat began declaring love for users and hostility to their spouses, showed that deployed systems could behave in ways their creators clearly did not intend. Anthropic published reward-hacking observations in Claude 3.5 Sonnet in early 2026. The UK AI Security Institute’s frontier evaluations have identified universal jailbreaks for every frontier system tested.

None of that means current systems are unsafe for business use. A model trained with state-of-the-art techniques is genuinely safer than one trained without them, and the techniques continue to improve. It is also true that adversarial prompts, distribution shift, and edge cases will keep producing failures, and that vendors who treat those failures as embarrassments to bury are a worse risk than vendors who publish postmortems.

For an owner the implication is practical. The right vendor question is not “is your AI aligned” but “how do you find alignment failures, and how will I hear about them when they happen in my deployment”.

When alignment becomes your procurement problem

Alignment moves from interesting to load-bearing in three contexts. Regulated environments come first. If you operate in financial services, healthcare or legal practice, alignment is part of your demonstrable compliance picture and the FCA, ICO and EU AI Act expect documented processes around it. Consequential decisions about individuals come second, including hiring, lending and performance review. Client-facing outputs come third.

In each, your firm wears the legal exposure when the model misbehaves, even if the model was trained elsewhere. Hiring shortlists, lending recommendations and credit assessments are areas where misalignment can produce direct discrimination. Anything an AI system produces under your name needs to behave as if your most cautious senior reviewer signed it off.

Alignment matters less when the system sits inside the business as a productivity helper, drafting internal notes or summarising documents under human review, where the worst-case output is a clumsy first draft rather than a regulatory breach. Even there, do not ignore it entirely. A vendor who has thought seriously about alignment is usually a vendor who has thought seriously about reliability, and the two correlate.

The split worth holding onto is this. Alignment is the vendor’s discipline, covering how the model was trained and what testing was done before release. Governance is your discipline, covering what you deploy the model for, what data you give it, who reviews its outputs, what audit trail you keep, and what happens when it fails. The one-page AI risk register and the twelve-question vendor due-diligence list are the deployment-side counterparts to this post. Both gates need to pass before a tool goes live.

Hallucination is the failure mode where a model produces confident, fluent output that is factually wrong. It is one of the things alignment training is trying to reduce, not a phenomenon separate from alignment. A vendor’s alignment story should include how the model is trained and evaluated against hallucination on the kinds of question your firm will actually ask.

Prompt injection is the failure mode where a malicious or careless input changes what the model does. Closely related to jailbreaks, where users coax a model into ignoring its safety guidelines through clever framing. Both are alignment-and-deployment problems. The vendor controls how well the model resists such attempts. You control what data and instructions ever reach it.

Interpretability is the degree to which humans can understand why a model produced a given output. Resilience to distribution shift describes how well the model keeps behaving correctly when inputs drift from the training set. Explainability emphasises human-readable justifications for outputs. Responsible AI is a broader umbrella covering alignment plus fairness, transparency, accountability and privacy. None of these are interchangeable with alignment, and a vendor who flattens them into one slogan has not done the work.

The vocabulary is there to give you enough purchase for the next vendor conversation. When the brochure says “our AI is aligned”, you can ask which technique, what evidence, and what happens when it fails, and treat the answers as the start of the contract conversation rather than the end.

If you want to talk about how to set up the deployment-side governance that sits alongside vendor alignment work, book a conversation.

Sources

Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. The foundational paper on the constitution-and-self-critique training method. https://arxiv.org/abs/2212.08073 Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. OpenAI. The InstructGPT paper that productionised RLHF. https://arxiv.org/abs/2203.02155 Christiano, P. et al. (2017). Deep Reinforcement Learning from Human Preferences. The foundational RLHF paper. https://arxiv.org/abs/1706.03762 Anthropic (2024). Many-shot Jailbreaking of Language Models. A vendor-published alignment-failure analysis showing the value of incident transparency. https://www.anthropic.com/research/many-shot-jailbreaking UK AI Security Institute (2025). 2025 year in review: frontier model evaluations. The universal-jailbreak finding for every frontier system tested. https://www.aisi.gov.uk/blog/our-2025-year-in-review UK Information Commissioner's Office (2023). Guidance on AI and data protection. The regulatory anchor for UK SME accountability when deploying AI. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/ European Union (2024). Artificial Intelligence Act, Official Journal text. The legal framework defining high-risk AI systems and provider obligations. https://eur-lex.europa.eu/eli/reg/2024/1689/oj OpenAI (2024). Deliberative alignment: reasoning enables safer language models. The explicit-reasoning approach to alignment. https://openai.com/index/deliberative-alignment/ National Cyber Security Centre (2024). AI and cyber security: what you need to know. UK guidance treating hallucination, bias and prompt injection as intrinsic to current systems. https://www.ncsc.gov.uk/guidance/ai-and-cyber-security-what-you-need-to-know Financial Conduct Authority (2024). AI update. The FCA's published expectations on validation, governance and accountability for AI in regulated firms. https://www.fca.org.uk/publications/corporate-documents/ai-update

Frequently asked questions

What does "our AI is aligned with your values" actually mean?

On its own, very little. It is a marketing claim that becomes meaningful only when the vendor names a specific technique, points to evidence, and shows incident-response documentation. Ask three follow-ups. Which alignment method did you use, in plain language. What testing has validated your claims. What happens, and how am I told, when the system behaves unexpectedly in production. A vendor who answers all three has built alignment as a product. A vendor who deflects has built it as a slogan.

Is alignment a solved problem in 2026?

No. Current techniques catch many issues before deployment, but the UK AI Security Institute has identified universal jailbreaks for every frontier system it has tested, and Anthropic published reward-hacking observations in Claude 3.5 Sonnet in early 2026. Treat alignment as an ongoing process. The right vendor question is "how do you find and respond to alignment failures" rather than "is your AI aligned".

If the vendor handles alignment, what is left for me to do?

A great deal. Alignment is the vendor's discipline, covering how the model was trained and tested. Governance is your discipline, covering what you deploy the model for, what data you feed it, who reviews its outputs, what audit trail you keep, and what happens when it fails. UK law holds your firm liable for the system's behaviour in your name regardless of vendor claims. Treat the two as separate gates and pass both before going live.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation