What is data classification? Why it matters for your business

A business professional reviewing printed documents at a desk in a well-lit office
TL;DR

Data classification assigns sensitivity labels to your information, giving AI tools, copilots, and automation the instruction they need to enforce access rules. Without those labels, AI systems will reach whatever they can. UK regulators, including the ICO, FCA, and NCSC, treat classification as a precondition for responsible AI deployment. For SMEs with any AI tool connected to operational data, classification should happen before the rollout goes live.

Key takeaways

- Data classification assigns sensitivity labels to your information so AI tools, copilots, and automated systems know which data they are permitted to access and use. - Without classification, a copilot with access to shared drives will surface client records, employee data, or draft board papers based on query relevance, not sensitivity. - UK regulators including the ICO, FCA, and NCSC treat data classification as a precondition for responsible AI deployment; its absence is an auditable gap. - The ICO fined British Airways £20 million for security failures affecting approximately 400,000 customers' personal data, demonstrating what regulators expect of organisations running complex systems with sensitive data flowing through them. - Classification urgency scales with your AI footprint: if your tool connects to operational or client-facing data, classification should be in place before go-live, not after.

A few weeks into deploying a copilot across her team, the MD of a small professional services firm asked a question nobody had thought to raise: could the assistant read the client due-diligence files? The tool had access to the shared drives. The shared drives held everything. That question, asked late, describes the problem data classification exists to prevent.

What is data classification?

Data classification labels your information according to its sensitivity so proportionate controls can apply. The standard approach uses four tiers: public (website content, marketing material), internal (routine communications and policies), confidential (client records, contracts, employee data), and highly confidential (trade secrets, financial transactions, special-category personal data). Labels travel with the data and drive decisions about access, encryption, retention, and which AI systems can read it.

The labels themselves are not new. Large enterprise security frameworks have used them for years. What AI changes is the stakes. When a copilot, chatbot, or automation tool connects to your file storage, email, or CRM, it reads whatever is available. Without classification, “whatever is available” can include client medical histories, employee salary data, or draft board papers. Classification is the instruction set that tells an AI tool which data it is permitted to work with.

Classification can be maintained manually, and for smaller firms with limited AI integrations, a spreadsheet and clear policies may be adequate. Increasingly, though, the labelling itself is done by AI. Proofpoint’s enterprise data-loss prevention platform runs parallel detectors across email, cloud storage, and endpoints, identifying personally identifiable information, health records, payment card data, and source code in real time, then triggering automated protections based on those labels. The classifier does in seconds what would take weeks to do by hand.

Why does it matter for your business?

Without classification, AI tools connected to your systems have no instruction about what they should and should not read. A copilot with access to shared drives surfaces information based on relevance to the query, not sensitivity. Classification is the mechanism for drawing the line: public and carefully selected internal content can feed a third-party AI tool; confidential and highly confidential material cannot.

The regulatory position is clear. The ICO’s UK GDPR guidance requires organisations to identify and classify personal data assets so they can apply appropriate controls, and its AI and data-protection guidance extends those obligations explicitly to AI deployments, requiring data minimisation, purpose limitation, and security to be maintained when training and running AI systems.

In 2020, the ICO fined British Airways £20 million for security failures affecting approximately 400,000 customers’ personal data. Marriott received an £18.4 million fine the same year for failing to protect around 339 million guest records. Neither case involved AI specifically, but both illustrate what regulators expect of organisations running complex systems with sensitive data flowing through them.

For businesses in regulated sectors, the Financial Conduct Authority has made the expectation explicit. Its joint AI Public-Private Forum with the Bank of England identified data lineage and quality as central obligations when deploying AI in credit, trading, or customer-facing contexts. Knowing what data feeds your AI is a conduct obligation.

IBM’s 2023 Cost of a Data Breach report adds the commercial case. Organisations with extensive AI and automation in their security operations had breach costs averaging $1.76 million lower than those with limited use of those capabilities. Classification is foundational to that automation.

Where will you actually meet it?

Many founders first encounter data classification at one of three practical moments: when a security insurer or compliance auditor asks how sensitive data is classified; when a third-party AI platform’s terms of service require specifying what categories of data will be processed; or when a regulator questions how sensitive data is handled in automated workflows.

The EU AI Act has created a fourth moment. Under Article 6 and Annex III, AI systems used for credit scoring, employment decisions, or access to essential services are classified as high-risk, and the compliance obligations for those systems include risk management, data governance, and logging. All of those obligations assume the organisation can identify which datasets feed which AI application and whether those datasets include special categories of personal data. Tools like Atlan’s AI Application Risk Classifier already use this framework to score AI applications by regulatory risk, requiring classified metadata as an input.

The Capita cyber incident in 2023 illustrated the consequence of the alternative. The Pensions Regulator confirmed that personal data from up to 90 pension schemes may have been exposed following the breach, with risk commentary pointing to the difficulty of protecting large volumes of unstructured, poorly labelled data held in legacy systems. NCSC’s 2023 Annual Review identified ransomware as the most significant cyber threat to UK businesses and recommended tight classification of sensitive data as a core mitigation, precisely because unclassified data is the easiest to exfiltrate.

Classification also creates operational benefit beyond risk reduction. Knostic, a data security firm, cites industry research showing that organisations which master unstructured data classification cut preparation time for new AI workloads by more than half. Once your data is labelled, connecting a new AI tool becomes a policy decision rather than a discovery exercise.

When does classification deserve investment, and when do basic controls suffice?

The size of your AI footprint determines how urgently classification deserves attention. If a copilot, chatbot, or automation tool connects to any system holding client records, employee data, financial information, or legally privileged material, classification should happen before the deployment goes live, not after. The question at that point is whether to do it manually or to invest in an automated tool.

For a small firm running standard office tools with no external AI integrations, basic access controls, role-based permissions, and staff training on what not to paste into a public AI system may deliver adequate risk reduction without a dedicated classification tool.

The picture shifts once any of these conditions applies: you are in a regulated sector; you are connecting AI tools to client-facing or operational data; you are building a knowledge base that AI systems will query; or you are processing special-category personal data under GDPR. In those situations, classification is the governance layer that everything else depends on.

Automated classifiers are also not infallible, and that matters. Knostic’s architecture handles uncertainty through confidence scoring: predictions above a 0.95 probability threshold are auto-accepted, those in the 0.60 to 0.94 range are queued for human review, and low-confidence items are quarantined pending policy. That human-in-the-loop step is particularly important where GDPR data-subject rights are engaged. A label is only as useful as the policy it enforces.

Data classification connects to three concepts you will encounter when building an AI governance framework. Data residency determines where classified data can be stored, including which countries or cloud regions are permitted. Data minimisation sets how much of it you should hold in the first place. Access control uses the classification labels to decide which people, roles, and AI systems can read each tier.

The EU AI Act’s risk classification for AI applications is related but distinct. It classifies the application itself (prohibited, high-risk, limited-risk, minimal-risk) rather than the data flowing through it. A complete governance picture requires both: which data sits at what sensitivity level, and which AI application is rated at what risk level. High-risk AI systems generally require stricter data governance as part of compliance, so the two classifications interact.

Attribute-based access control is the technical mechanism that makes data classification machine-actionable. Instead of manually configuring permissions for every file and every user, the system reads the label on the data and the attributes of the user’s role, then enforces the appropriate policy automatically. Zero-trust architectures extend this by requiring data labels to be verified before any system accesses a resource.

For founders building toward an AI governance policy, the ICO’s guidance on AI and data protection, the NCSC’s joint guidance on secure AI system development, and the FCA’s AI Public-Private Forum outputs all assume the organisation can describe what data feeds its AI systems and why. Classification is what makes that description possible. If you are connecting AI tools to internal systems and have not yet answered the question the MD raised, that is where to begin.

If you would like to think through what a classification approach looks like in your specific business context, Book a conversation.

Sources

- ICO (2024). Security of personal data: Guide to the UK GDPR. Establishes the requirement to identify and classify personal data assets before applying appropriate technical and organisational controls. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/security/security-of-personal-data/ - ICO (2024). AI and data protection. Extends data minimisation, purpose limitation, and security obligations explicitly to AI training and deployment. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ai-and-data-protection/ - Financial Conduct Authority and Bank of England (2022). AI Public-Private Forum: Final Report. Sets out expectations for data lineage and quality when deploying AI in regulated financial contexts. https://www.fca.org.uk/publication/corporate/ai-public-private-forum-final-report.pdf - EU AI Act (2024). Article 6: Classification rules for high-risk AI systems. Defines high-risk categories and the data governance obligations that assume organisations can identify which datasets feed which AI system. https://artificialintelligenceact.eu/article/6/ - NCSC (2023). Guidelines for secure AI system development. Joint guidance from UK, US, and Five Eyes partners recommending data classification, provenance verification, and zero-trust principles for AI data supply chains. https://www.ncsc.gov.uk/guidance/guidelines-for-secure-ai-system-development - NCSC (2023). Annual Review 2023. Identifies ransomware as the most significant cyber threat to UK businesses and recommends tight classification of sensitive data as a core mitigation. https://www.ncsc.gov.uk/report/annual-review-2023 - ICO (2020). Penalty notice: British Airways. £20 million fine for security failures affecting approximately 400,000 customers' personal data, illustrating regulatory expectations for sensitive data protection in complex IT systems. https://ico.org.uk/action-weve-taken/enforcement/british-airways/ - The Pensions Regulator (2023). Update on the Capita cyber incident. Confirmed personal data from up to 90 pension schemes may have been exposed, with commentary pointing to the difficulty of protecting large volumes of unclassified legacy data. https://www.thepensionsregulator.gov.uk/en/media-hub/press-releases/2023-press-releases/update-on-capita-cyber-incident - IBM Security (2023). Cost of a Data Breach Report 2023. Found organisations using AI and automation extensively in security operations had breach costs averaging $1.76 million lower than those without. https://www.ibm.com/reports/data-breach - Proofpoint (2024). AI data classification for proactive data protection. Describes enterprise AI-driven classifiers identifying PII, PHI, PCI, and source code across cloud, email, and endpoints in real time. https://www.proofpoint.com/us/blog/dspm/ai-data-classification-proactive-data-protection

Frequently asked questions

Do I need to classify data before rolling out an AI copilot or chatbot?

Yes, if your tool connects to any system holding client records, employee data, financial information, or legally privileged material. Classification gives the tool an explicit instruction about which data it is permitted to read. Without it, a copilot with access to shared drives will surface whatever is available to it, regardless of how sensitive that information is.

What are the four standard data classification tiers?

Public (website content, marketing material), internal (routine communications and policies), confidential (client records, contracts, employee data), and highly confidential (trade secrets, financial transactions, special-category personal data under GDPR). Many AI governance frameworks map tool-access permissions directly to these four levels. The labels are straightforward to apply; the discipline is in keeping them current as your data grows.

What does the ICO expect in terms of data classification for AI?

The ICO's UK GDPR guidance requires organisations to identify and classify personal data assets before applying security controls. Its AI and data-protection guidance extends those obligations to AI deployments, requiring data minimisation, purpose limitation, and security to apply when training and running AI systems. In practice, that means knowing what data your AI processes and at what sensitivity level.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation