A practice manager receives a commissioner data request: patient breakdowns by primary diagnosis, GP practice code, and assessment type. All three fields exist in the EPR, but referral sources are entered as free text, date formats vary by clinician, and nobody agrees on what counts as a long-term assessment. The extract takes three days instead of three hours. A data dictionary would have prevented much of that.
What is a healthcare data dictionary?
A healthcare data dictionary is a controlled reference list covering every data field your organisation collects: name, definition, data type, accepted format, and allowed values. It spans patient identifiers, clinical episodes, coding systems, and reporting outputs. The NHS publishes several working examples, including the NHS Wales Data Dictionary and the Arden & GEM client-level data specification, so smaller providers can adapt existing work rather than build from scratch.
The NHS Wales Data Dictionary defines the structure for patient details, referral information, attendance records, and contract data, with clear item definitions and accepted formats. The Arden & GEM client-level data specification covers fields such as Person Unique Identifier, GP Practice Name, and Assessment Type, noting which event types each applies to.
At larger scale, a clinical data dictionary built at a 500-bed cancer centre collected 12,994 local terms from 98 forms across 23 departments. After cleaning, 9,418 terms were mapped to a common standard and validated by 30 clinicians and nurses. The Scottish COVID-19 research dataset provides another example: a multi-table dictionary covering accident and emergency attendances, inpatient episodes, derived condition flags, and prescribing histories, all grounded in ICD-10 codes.
A small provider will not need that depth. The principle holds regardless of scale: one agreed reference that everyone uses, rather than a drift of local interpretations across your EPR, your reporting templates, and your staff operating procedures.
Why does this matter for your practice?
Under UK GDPR, health data is special category data, which requires documenting what you collect and why. A data dictionary is direct evidence of data minimisation and purpose limitation: which fields exist, who is responsible for each, and what purpose each serves. The ICO’s Data Sharing Code of Practice requires clear field-level specifications whenever health data passes to a third party.
The documentation obligation intensifies if you are using AI tools with patient data. The ICO’s AI and data protection guidance requires organisations to identify exactly what personal data is used in AI training or inference, and to process only what is necessary for stated purposes. The NCSC’s guidelines for secure AI system development advise maintaining accurate data inventories and field-level metadata to support access controls and security restrictions.
The EU AI Act, adopted in 2024, classifies many medical AI systems as high-risk and requires detailed documentation of data governance and training data quality. UK providers using EU-hosted AI services or serving EU markets may be indirectly affected.
One enforcement example shows what absent documentation costs. In 2017, the ICO found Royal Free NHS Foundation Trust in breach of data protection law after 1.6 million patient records were shared with Google DeepMind without appropriate legal basis or documentation of purpose. The Medical Defence Union links poor data documentation directly to avoidable clinical negligence claims.
Where will you actually meet it?
A working dictionary organises fields into three domains: patient records, clinical episodes, and reporting outputs. Each domain has mandatory fields that appear on every record and optional fields that apply only to specific workflows. The goal is one agreed definition per field, maintained in a document that clinicians, admin staff, and whoever handles data submissions can all refer to.
Patient data. Core fields drawn from NHS guidance include an internal patient ID (mandatory on every record), NHS number as a 10-digit numeric field, date of birth expressed as dd/mm/yyyy and converted to age band for external analytics, and GP practice code used to disaggregate data by catchment area. The Arden & GEM specification notes that missing GP practice data can be corrected through the NHS batch tracing service, making accuracy from the outset worth enforcing.
Clinic and episode data. One row per inpatient or outpatient episode, linkable where relevant to an accident and emergency attendance. Standard fields include episode ID, referral source constrained to a defined pick list (GP referral, self-referral, other provider), and assessment type categorised as long-term or short-term to enable Care Act reporting.
Reporting and outcomes. These are derived fields: binary condition flags built from ICD-10 diagnostic codes or community prescribing data, such as a chronic lower respiratory disease flag derived from previous COPD hospitalisations or inhaler prescriptions.
For the template structure itself, a minimal column set covers FieldName, Label, Description, DataType, Format, AllowedValues or CodeSystem, Mandatory, SourceSystem, Owner, Purpose, and SharingRules. That last column earns its place. You can mark specific fields as “never sent to third-party AI tools” or “anonymised only for external reporting”, and enforce those rules in integration scripts and staff operating procedures. Free-text fields should be flagged explicitly, with a note on what should not be entered there.
When is a dictionary worth building, and when can you skip it?
Whether a dictionary is worth the effort depends on three things: whether you share data outside your organisation, whether you use AI tools where patient data could be in scope, and whether your EPR vendor maintains a schema you can rely on. A single-site practice with no external reporting has a weaker case than one submitting regular commissioner returns or building an analytics layer.
Build one if any of these apply: you submit regular data returns to a commissioner or NHS body; you share patient data with any third party, whether a platform, a research partner, or a subcontracted service; you use AI tools where patient data could be in scope; or you are planning to switch EPR providers, where a documented field set makes migration considerably less painful. A well-maintained dictionary also reduces EPR vendor lock-in.
Hold off if none of those conditions apply and your vendor’s built-in schema already covers your reporting needs. A ten-person practice on a single off-the-shelf system with no data sharing and no AI tools in scope has limited return on a standalone dictionary.
One practical note on implementation: clinician buy-in is where dictionary projects commonly stall. A clinical data dictionary project at a 500-bed cancer centre required validation from 30 clinicians and nurses; without comparable engagement from the people who enter data, the dictionary becomes a document nobody reads. For a small provider, the approach that works is to start in a spreadsheet, run a short session with the three or four people who handle data submissions, and iterate from what your systems actually contain.
What else should you know before you start?
A data dictionary connects to several governance obligations and standards that small healthcare providers need to understand. The most important are the clinical coding systems your fields reference, the pseudonymisation rules governing what you can share externally, and the data protection impact assessment process the ICO expects when processing health data at scale or introducing AI tools.
Clinical coding. SNOMED CT is the NHS standard terminology for clinical concepts, maintained by NHS Digital. ICD-10 is used for diagnostic coding in hospital episode records. Where a structured code exists, the dictionary should reference it and constrain the field accordingly. Free text is harder to aggregate, harder to validate, and harder to share safely.
Pseudonymisation. The ICO is clear that pseudonymised data is still personal data and must be treated as such. A common pitfall is misplaced confidence in anonymisation: a combination of date of birth, postcode, and a rare condition code can be re-identifying even without a name attached. The dictionary should flag high-risk field combinations and define what pseudonymisation means in practice for each reporting output.
Data Protection Impact Assessments. A DPIA is required before high-risk processing, and the ICO treats AI systems processing health data as high-risk by default. The dictionary is the natural input to a DPIA: it lists the fields being processed, their purpose, and who has access. Keeping the dictionary current makes the DPIA significantly less painful when you need one.
NHS information governance guidance. NHS England’s information governance guidance emphasises clear data specifications when creating datasets or sharing with third parties. For smaller providers working adjacent to the NHS as referral sources or subcontractors, alignment with those standards reduces friction at every exchange.



