Duplicate, conflicting, missing: the three SME data problems AI exposes first

An SME owner at a desk comparing two laptop screens showing the same customer recorded differently in CRM and accounting systems
TL;DR

When an SME runs its first AI tool over its customer or business records, three problems show up with predictable regularity: the same customer appears as three records, the same field has two values across systems, and the field the AI actually needs is empty for half the entries. Owners who recognise these as the standard SME data triad fix them in roughly three weeks of effort, not a multi-month overhaul.

Key takeaways

- Duplicates, conflicts, and missing values are the standard SME data triad. Every AI tool surfaces them in the first week, and the fixes are well understood. - Gartner research finds duplication rates of ten to thirty per cent are common in firms without a deliberate data quality discipline, which describes the typical SME. - Conflicts are an architectural problem. Declare one system of record per data domain in a half-day meeting, document it, and configure integrations to sync from the authoritative system. - Missing data only matters for the fields the AI actually uses. Backfill the three to five fields critical to your priority use case, not the twenty that look important. - The proportionate clean-up sequence is roughly three weeks: one week deduplication and conflict resolution, two weeks missing-data triage and phased backfill, then a monthly fifteen-minute review to prevent reaccumulation.

An SME owner watched her new AI sales assistant produce three different recommendations for what was clearly the same customer account. Same company, slightly different spelling. The tool treated them as three opportunities. She was about to write to a meeting and say the AI was broken. The AI was not broken. The customer existed as three records in her CRM, with subtly different names and overlapping but inconsistent contact details, and the AI had done exactly what AI does. It learned the pattern in the data and replicated it confidently at scale.

This is the standard SME data triad: duplicates, conflicts, and missing values. Every AI tool pointed at SME records surfaces them in the first week, and many owners spend a fortnight in confused investigation before recognising the pattern. The fixes are proportionate, well understood, and roughly three weeks of effort for a small firm. They are not a multi-month enterprise data project.

What is the SME data triad and why does AI expose it?

The triad is duplicates (same customer recorded multiple times across systems), conflicts (same field with different values in different systems), and missing values (the field the AI needs is empty for half the entries). It exists because SME data accumulates without governing architecture, growing organically from spreadsheets to CRMs to accounting tools to email platforms, each storing customer information independently. AI exposes it immediately because AI trains on the exact data you feed it.

Gartner research summarised by data quality vendors finds duplication rates of ten to thirty per cent are common in firms without deliberate quality initiatives, which describes the typical SME. A small firm commonly runs between five and fifteen disparate systems by year three, each with its own field definitions and update schedules, and no single point of authority for which version is correct. The AI is best read as a high-resolution scanner for the data layer underneath. It makes the problem visible at a speed and scale that manual review never did, which is uncomfortable but useful.

Why do duplicate records form in SME systems?

Duplicates form through five predictable mechanisms: manual entry variations (“John Smith” once, “J. Smith” another), system migrations that fail to match source and destination, third-party imports without deduplication checks, web forms that create new contacts instead of updating existing ones, and integration webhooks that fire twice on retry. The underlying enabler is the absence of a unique identifier at record creation.

The discipline that catches around eighty per cent of duplicates is simpler than firms typically assume. Match on email address as the primary key for prospects and customers, because if two records share an email they are almost certainly the same person. Add fuzzy matching for near-duplicates (“jon.smith@company.com” against “jon_smith@company.com”). Configure the CRM to alert users when they are about to create a duplicate, and merge accumulated duplicates into golden records rather than deleting either side, so the activity history survives. Process in batches of five hundred to one thousand, not ten thousand.

When do conflicting values become a business problem?

Conflicts become a problem when the same fact has two values across systems and no point of authority decides which is correct. A customer’s address changed last week, the CRM has the new one, the accounting system still holds the old. Sales records the contact as “Decision Maker”; marketing’s platform stores “Stakeholder” for the same role field. Each value is plausible, no rule says which wins, and reconciliation eats hours of meeting time.

The fix does not require rebuilding the data architecture. It requires a half-day meeting and a one-page document. Declare a system of record for each critical data domain. The CRM is authoritative for customer contact information and conversation history because customer-facing teams update it daily and have the incentive to keep it accurate. The accounting system is authoritative for billing address, invoice amount, and payment status because finance depends on it for statutory reporting. Write the split down, configure integrations to sync from the authoritative side, and conflicts become deterministic rather than negotiated.

Where does missing data actually hurt and where does it not?

Missing data hurts where the AI tool actually uses the field for predictions, and almost nowhere else. An AI outreach system needs an email address; missing emails shrink the addressable list one for one. A revenue forecasting model needs opportunity value and sales stage; absent values produce useless forecasts. The fields that feel important are often not the fields the model uses.

The prioritisation rule that saves weeks of unnecessary work is to identify the three to five fields genuinely critical to the priority AI use case, then measure completeness on only those fields against the most recent one thousand records. If completeness is above eighty per cent, deploy and stop worrying. If it is below fifty per cent, either backfill the last two years or redesign the AI function to work without that field. Sixty to eighty per cent is a judgement call. An email completeness of seventy-five per cent still lets an outreach tool reach three-quarters of the audience, which is fine.

How long should an SME spend on data clean-up before deploying AI?

Roughly three weeks of elapsed effort, with the heaviest labour in the first week. Week one is deduplication and conflict resolution: three days deduping the primary customer list, two days resolving conflicts in the five most-accessed fields and writing the one-page rule sheet. Week two is missing-data triage: identify which fields the AI tool actually uses, measure completeness, backfill only the fields below eighty per cent. Week three sets the maintenance discipline.

After that, the continuing cost is a monthly fifteen-minute review by a named data steward, usually the person already managing the CRM or accounting system. They track three metrics: duplicate rate (target below two per cent), conflict rate on authoritative fields (target zero), and completeness on the AI-critical fields (target above eighty per cent). When a metric drifts, they investigate why. This is a federated governance model, not an enterprise master data management programme, and it is the proportionate approach for a small firm.

The principle that sits underneath all of this is that data readiness is the prerequisite for every AI use case, not a one-off project that fixes the problem permanently. The triad will reappear if the discipline lapses. The good news is that the fixes compound. Clean vendor lists for invoice AI help the knowledge base. Standardised transaction codes for financial AI help reconciliation. The first deployment carries the heaviest readiness cost; subsequent deployments inherit the work and run cheaper. The owners who recognise this stop blaming the AI and start budgeting for the data layer underneath it.

If you want help working out which clean-up matters for your firm’s priority AI use case, book a conversation.

Sources

- Gartner research on data quality, summarised in OvalEdge's catalogue of common SME data quality issues including duplication rates of ten to thirty per cent in firms without deliberate quality initiatives. https://www.ovaledge.com/blog/data-quality-problems - IBM Think (2024). The Cost of Poor Data Quality. Industry analysis of how data quality problems compound across systems and the financial impact on operating margins. https://www.ibm.com/think/insights/cost-of-poor-data-quality - IBM Think (2024). System of Record. Explainer on the architectural principle that resolves conflicting values across multiple business systems. https://www.ibm.com/think/topics/system-of-record - Forrester Blogs (2024). Poor change management kills CRM success. Forrester analysis of why CRM data quality degrades after rollout and what stops it. https://www.forrester.com/blogs/poor-change-management-kills-crm-success-here-is-how-to-get-it-right/ - Microsoft Learn (2024). Merge duplicate records in Dynamics 365 Business Central. Technical documentation on the merge function that consolidates customer activity history into a single golden record. https://learn.microsoft.com/en-us/dynamics365/business-central/sales-how-merge-duplicate-records - Microsoft Learn (2024). Set up duplicate detection rules to keep your data clean. Configuration guide for preventing duplicates at the point of record creation. https://learn.microsoft.com/en-us/power-platform/admin/set-up-duplicate-detection-rules-keep-data-clean - DataVersity (2024). Managing Missing Data in Analytics. Analysis of why missing data is often more damaging than inaccurate data for downstream models. https://www.dataversity.net/articles/managing-missing-data-in-analytics/ - SR Analytics (2024). Why 95 percent of AI projects fail. Research summary identifying data quality as the dominant cause of AI pilot failure in mid-sized firms. https://sranalytics.io/blog/why-95-of-ai-projects-fail/ - Data Ladder (2024). Fuzzy Matching 101. Practitioner guide to the matching logic that catches eighty per cent of duplicate records in customer databases. https://dataladder.com/fuzzy-matching-101/ - AWS for SMBs (2024). Data Governance Strategy: Five Steps for SMBs. Practical framework for federated data governance without enterprise overhead. https://aws.amazon.com/smart-business/resources-for-smb/data-governance-strategy-5-steps-for-smbs/

Frequently asked questions

How do I know if my SME data is clean enough for AI?

Sample five hundred to one thousand customer records and check three numbers. Duplication rate above two per cent is a problem. Completeness below eighty per cent on any field the AI needs is a problem. Conflicts between authoritative systems on contact details are a problem. If all three are inside those bounds, the data is clean enough to deploy. If any one is outside, do the proportionate clean-up first and the AI tool will pay back faster.

Do I need master data management software for an SME?

No. Enterprise master data management is built for organisations running fifty or more systems with thousands of users and statutory reporting obligations across regions. An SME running five to fifteen systems with a few dozen users can achieve the same outcome with a one-page rule sheet, a named data steward, and a monthly fifteen-minute review. The discipline is what matters, not the software.

What if my CRM and accounting system both think they own the customer record?

That is the most common SME conflict. Split the field-level ownership rather than the customer record. The CRM owns contact details, conversation history, and pipeline stage because customer-facing teams keep that current. Accounting owns billing address, invoice amount, and payment status because finance teams depend on it for statutory reporting. Document the split, configure integrations to sync from the authoritative side, and the conflict stops.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation