An SME owner watched her new AI sales assistant produce three different recommendations for what was clearly the same customer account. Same company, slightly different spelling. The tool treated them as three opportunities. She was about to write to a meeting and say the AI was broken. The AI was not broken. The customer existed as three records in her CRM, with subtly different names and overlapping but inconsistent contact details, and the AI had done exactly what AI does. It learned the pattern in the data and replicated it confidently at scale.
This is the standard SME data triad: duplicates, conflicts, and missing values. Every AI tool pointed at SME records surfaces them in the first week, and many owners spend a fortnight in confused investigation before recognising the pattern. The fixes are proportionate, well understood, and roughly three weeks of effort for a small firm. They are not a multi-month enterprise data project.
What is the SME data triad and why does AI expose it?
The triad is duplicates (same customer recorded multiple times across systems), conflicts (same field with different values in different systems), and missing values (the field the AI needs is empty for half the entries). It exists because SME data accumulates without governing architecture, growing organically from spreadsheets to CRMs to accounting tools to email platforms, each storing customer information independently. AI exposes it immediately because AI trains on the exact data you feed it.
Gartner research summarised by data quality vendors finds duplication rates of ten to thirty per cent are common in firms without deliberate quality initiatives, which describes the typical SME. A small firm commonly runs between five and fifteen disparate systems by year three, each with its own field definitions and update schedules, and no single point of authority for which version is correct. The AI is best read as a high-resolution scanner for the data layer underneath. It makes the problem visible at a speed and scale that manual review never did, which is uncomfortable but useful.
Why do duplicate records form in SME systems?
Duplicates form through five predictable mechanisms: manual entry variations (“John Smith” once, “J. Smith” another), system migrations that fail to match source and destination, third-party imports without deduplication checks, web forms that create new contacts instead of updating existing ones, and integration webhooks that fire twice on retry. The underlying enabler is the absence of a unique identifier at record creation.
The discipline that catches around eighty per cent of duplicates is simpler than firms typically assume. Match on email address as the primary key for prospects and customers, because if two records share an email they are almost certainly the same person. Add fuzzy matching for near-duplicates (“jon.smith@company.com” against “jon_smith@company.com”). Configure the CRM to alert users when they are about to create a duplicate, and merge accumulated duplicates into golden records rather than deleting either side, so the activity history survives. Process in batches of five hundred to one thousand, not ten thousand.
When do conflicting values become a business problem?
Conflicts become a problem when the same fact has two values across systems and no point of authority decides which is correct. A customer’s address changed last week, the CRM has the new one, the accounting system still holds the old. Sales records the contact as “Decision Maker”; marketing’s platform stores “Stakeholder” for the same role field. Each value is plausible, no rule says which wins, and reconciliation eats hours of meeting time.
The fix does not require rebuilding the data architecture. It requires a half-day meeting and a one-page document. Declare a system of record for each critical data domain. The CRM is authoritative for customer contact information and conversation history because customer-facing teams update it daily and have the incentive to keep it accurate. The accounting system is authoritative for billing address, invoice amount, and payment status because finance depends on it for statutory reporting. Write the split down, configure integrations to sync from the authoritative side, and conflicts become deterministic rather than negotiated.
Where does missing data actually hurt and where does it not?
Missing data hurts where the AI tool actually uses the field for predictions, and almost nowhere else. An AI outreach system needs an email address; missing emails shrink the addressable list one for one. A revenue forecasting model needs opportunity value and sales stage; absent values produce useless forecasts. The fields that feel important are often not the fields the model uses.
The prioritisation rule that saves weeks of unnecessary work is to identify the three to five fields genuinely critical to the priority AI use case, then measure completeness on only those fields against the most recent one thousand records. If completeness is above eighty per cent, deploy and stop worrying. If it is below fifty per cent, either backfill the last two years or redesign the AI function to work without that field. Sixty to eighty per cent is a judgement call. An email completeness of seventy-five per cent still lets an outreach tool reach three-quarters of the audience, which is fine.
How long should an SME spend on data clean-up before deploying AI?
Roughly three weeks of elapsed effort, with the heaviest labour in the first week. Week one is deduplication and conflict resolution: three days deduping the primary customer list, two days resolving conflicts in the five most-accessed fields and writing the one-page rule sheet. Week two is missing-data triage: identify which fields the AI tool actually uses, measure completeness, backfill only the fields below eighty per cent. Week three sets the maintenance discipline.
After that, the continuing cost is a monthly fifteen-minute review by a named data steward, usually the person already managing the CRM or accounting system. They track three metrics: duplicate rate (target below two per cent), conflict rate on authoritative fields (target zero), and completeness on the AI-critical fields (target above eighty per cent). When a metric drifts, they investigate why. This is a federated governance model, not an enterprise master data management programme, and it is the proportionate approach for a small firm.
The principle that sits underneath all of this is that data readiness is the prerequisite for every AI use case, not a one-off project that fixes the problem permanently. The triad will reappear if the discipline lapses. The good news is that the fixes compound. Clean vendor lists for invoice AI help the knowledge base. Standardised transaction codes for financial AI help reconciliation. The first deployment carries the heaviest readiness cost; subsequent deployments inherit the work and run cheaper. The owners who recognise this stop blaming the AI and start budgeting for the data layer underneath it.
If you want help working out which clean-up matters for your firm’s priority AI use case, book a conversation.



