What is data classification?

A few weeks into deploying a copilot across her team, the MD of a small professional services firm asked a question nobody had thought to raise. Could the assistant read the client due-diligence files? The tool had access to the shared drives. The shared drives held everything. That question, asked late, describes the problem data classification exists to prevent.

What is data classification?

Data classification labels your information according to its sensitivity so proportionate controls can apply. The standard approach uses four tiers, public (website content, marketing material), internal (routine communications and policies), confidential (client records, contracts, employee data), and highly confidential (trade secrets, financial transactions, special-category personal data). Labels travel with the data and drive decisions about access, encryption, retention, and which AI systems can read it.

The labels themselves are not new. Large enterprise security frameworks have used them for years. What AI changes is the stakes. When a copilot, chatbot, or automation tool connects to your file storage, email, or CRM, it reads whatever is available. Without classification, “whatever is available” can include client medical histories, employee salary data, or draft board papers. Classification is the instruction set that tells an AI tool which data it is permitted to work with.

Classification can be maintained manually, and for smaller firms with limited AI integrations, a spreadsheet and clear policies may be adequate. Increasingly, though, the labelling itself is done by AI. Proofpoint’s enterprise data-loss prevention platform runs parallel detectors across email, cloud storage, and endpoints, identifying personally identifiable information, health records, payment card data, and source code in real time, then triggering automated protections based on those labels. The classifier does in seconds what would take weeks to do by hand.

Why does it matter for your business?

Without classification, AI tools connected to your systems have no instruction about what they should and should not read. A copilot with access to shared drives surfaces information based on relevance to the query, not sensitivity. Classification draws that line. Public and carefully selected internal content can feed a third-party AI tool; confidential and highly confidential material cannot.

The regulatory position is clear. The ICO’s UK GDPR guidance requires organisations to identify and classify personal data assets so they can apply appropriate controls, and its AI and data-protection guidance extends those obligations explicitly to AI deployments, requiring data minimisation, purpose limitation, and security to be maintained when training and running AI systems.

In 2020, the ICO fined British Airways £20 million for security failures affecting approximately 400,000 customers’ personal data. Marriott received an £18.4 million fine the same year for failing to protect around 339 million guest records. Neither case involved AI specifically, but both illustrate what regulators expect of organisations running complex systems with sensitive data flowing through them.

For businesses in regulated sectors, the Financial Conduct Authority has made the expectation explicit. Its joint AI Public-Private Forum with the Bank of England identified data lineage and quality as central obligations when deploying AI in credit, trading, or customer-facing contexts. Knowing what data feeds your AI is a conduct obligation.

IBM’s 2023 Cost of a Data Breach report adds the commercial case. Organisations with extensive AI and automation in their security operations had breach costs averaging $1.76 million lower than those with limited use of those capabilities. Classification is foundational to that automation.

Where will you actually meet it?

Many founders first encounter data classification at one of three practical moments. A security insurer or compliance auditor asks how sensitive data is classified. A third-party AI platform’s terms of service require specifying what categories of data will be processed. Or a regulator questions how sensitive data is handled in automated workflows.

The EU AI Act has created a fourth moment. Under Article 6 and Annex III, AI systems used for credit scoring, employment decisions, or access to essential services are classified as high-risk, and the compliance obligations for those systems include risk management, data governance, and logging. All of those obligations assume the organisation can identify which datasets feed which AI application and whether those datasets include special categories of personal data. Tools like Atlan’s AI Application Risk Classifier already use this framework to score AI applications by regulatory risk, requiring classified metadata as an input.

The Capita cyber incident in 2023 illustrated the consequence of the alternative. The Pensions Regulator confirmed that personal data from up to 90 pension schemes may have been exposed following the breach, with risk commentary pointing to the difficulty of protecting large volumes of unstructured, poorly labelled data held in legacy systems. NCSC’s 2023 Annual Review identified ransomware as the most significant cyber threat to UK businesses and recommended tight classification of sensitive data as a core mitigation, precisely because unclassified data is the easiest to exfiltrate.

Classification also creates operational benefit beyond risk reduction. Knostic, a data security firm, cites industry research showing that organisations which master unstructured data classification cut preparation time for new AI workloads by more than half. Once your data is labelled, connecting a new AI tool becomes a policy decision rather than a discovery exercise.

When does classification deserve investment, and when do basic controls suffice?

The size of your AI footprint determines how urgently classification deserves attention. If a copilot, chatbot, or automation tool connects to any system holding client records, employee data, financial information, or legally privileged material, classification should happen before the deployment goes live, not after. The question at that point is whether to do it manually or to invest in an automated tool.

For a small firm running standard office tools with no external AI integrations, basic access controls, role-based permissions, and staff training on what not to paste into a public AI system may deliver adequate risk reduction without a dedicated classification tool.

The picture shifts when you are in a regulated sector, connecting AI tools to client-facing or operational data, building a knowledge base that AI systems will query, or processing special-category personal data under GDPR. In those situations, classification is the governance layer that everything else depends on.

Automated classifiers are also not infallible, and that matters. Knostic’s architecture handles uncertainty through confidence scoring. Predictions above a 0.95 probability threshold are auto-accepted, those in the 0.60 to 0.94 range are queued for human review, and low-confidence items are quarantined pending policy. That human-in-the-loop step is particularly important where GDPR data-subject rights are engaged. A label is only as useful as the policy it enforces.

Data classification connects to three concepts you will encounter when building an AI governance framework. Data residency determines where classified data can be stored, including which countries or cloud regions are permitted. Data minimisation sets how much of it you should hold in the first place. Access control uses the classification labels to decide which people, roles, and AI systems can read each tier.

The EU AI Act’s risk classification for AI applications is related but distinct. It classifies the application itself (prohibited, high-risk, limited-risk, minimal-risk) rather than the data flowing through it. A complete governance picture requires both answers. Which data sits at what sensitivity level, and which AI application is rated at what risk level. High-risk AI systems generally require stricter data governance as part of compliance, so the two classifications interact.

Attribute-based access control is the technical mechanism that makes data classification machine-actionable. Instead of manually configuring permissions for every file and every user, the system reads the label on the data and the attributes of the user’s role, then enforces the appropriate policy automatically. Zero-trust architectures extend this by requiring data labels to be verified before any system accesses a resource.

For founders building toward an AI governance policy, the ICO’s guidance on AI and data protection, the NCSC’s joint guidance on secure AI system development, and the FCA’s AI Public-Private Forum outputs all assume the organisation can describe what data feeds its AI systems and why. Classification is what makes that description possible. If you are connecting AI tools to internal systems and have not yet answered the question the MD raised, that is where to begin.

If you would like to think through what a classification approach looks like in your specific business context, Book a conversation.

What is data classification? Why it matters for your business

Key takeaways