Your IT consultant mentioned it at the end of a call. Something about running a ‘small language model’ on your own server rather than sending everything to OpenAI. You made a note. Six months on, it’s still there, partly because you weren’t sure what it was, and partly because you weren’t sure it mattered.
It likely matters more than you’d expect. The idea is less complicated than the label suggests.
What is a small language model?
A small language model (SLM) is a generative AI model with far fewer parameters than the large frontier models hosted by OpenAI, Google, and Anthropic. Typically under 10 billion parameters, it can still handle the same kinds of text tasks: drafting emails, summarising documents, answering questions. The difference is that it runs on ordinary hardware rather than needing specialised computing infrastructure.
Think of it as the difference between a specialist and a full-service advisory firm. The specialist knows one domain deeply and costs less per engagement. The firm knows a great deal about everything, which is useful when questions are genuinely complex and broad, but you’re paying for capacity you won’t always use.
The UK Parliament’s POST research note on large language models provides useful context: the ‘large’ in LLM refers to parameter count, and the distinction between large and small is informal and shifts as training techniques improve. Well-known SLMs include Meta’s Llama 3 8B model and Mistral’s 7B model. Both run on a single modern GPU or a powerful laptop, and both are open-weight, meaning the underlying code can be downloaded and deployed on your own infrastructure.
Why does it matter for your business?
For an owner-managed business running AI daily, cost and data control are the two pressures that accumulate. SLMs run on lighter hardware and typically cost considerably less per task than frontier cloud models. They can also run on your own server, keeping client data on your infrastructure rather than routing it to a third-party cloud. Both your monthly bill and your GDPR exposure change.
Thoughtworks UK’s analysis of small language models notes faster response times, lower costs, and reduced energy consumption compared with larger models. Thoughtworks also describes what they call the ‘specialised worker’ approach: rather than routing every task through one large cloud model, a set of smaller SLMs each handle a specific job, which is often considerably cheaper for high-volume, repetitive workflows.
The ICO’s guidance on generative AI makes clear that organisations must know where their data is stored and processed, and whether international transfers are occurring. An SLM running on a UK server gives you a cleaner answer to that question than a US-hosted cloud model. The CMA’s initial report on AI foundation models adds a separate angle: open-weight SLMs like Llama 3 and Mistral reduce dependency on a small number of large cloud providers, which aligns with the regulator’s concerns about market concentration in AI infrastructure.
Where will you actually meet it?
You’ll encounter SLMs most commonly in three configurations: on-device tools that process data without a network connection (a mobile app that generates site-visit reports before the engineer returns to the office), inside line-of-business software (a CRM that drafts follow-up emails from call notes), and self-hosted internal assistants that answer questions from your own documents, procedures, and case files.
The internal knowledge assistant is where many owner-managed businesses find the most immediate practical return. Rather than staff searching scattered documents or asking colleagues for answers, a well-configured SLM looks up your own policies and procedures before responding. The model works from your documents rather than needing encyclopaedic coverage.
Client-facing applications work well when the service scope is narrow. A firm with a defined offering, such as a fixed-process clinic, a specialist tax adviser, or a managed IT provider, can deploy a focused bot that handles common questions accurately and keeps client data within its own infrastructure. The NCSC’s guidance on security considerations for AI as a service is directly relevant here: your attack surface expands when data flows out to external systems, so on-device or on-premise SLMs reduce that exposure compared with cloud-routed alternatives.
When should you ask about it, and when should you ignore it?
Consider an SLM when you have a specific, repetitive text task running at high frequency, involving client or staff data, with no need for creative breadth or open-ended reasoning. Stick with a frontier cloud model when the task is complex, brand-critical, or infrequent enough that the cost difference doesn’t justify the additional setup. The deciding factor is the task, not the technology.
Thoughtworks UK is candid about where SLMs fall short: limited general knowledge, lower accuracy on complex tasks, and less nuanced language generation than frontier models. If you’re producing long-form marketing content, handling sensitive advisory communications, or asking the model to reason across unfamiliar territory, a larger model will usually serve you better.
There’s a timing dimension worth keeping in mind. Frontier models have been getting cheaper and faster each year, and that trend continues. If the cost gap between large and small models narrows significantly over the next few years, the case for the extra setup behind a self-hosted SLM becomes harder to justify for a small operation.
The FCA’s 2023 discussion paper on AI in financial services makes clear that regulated firms remain accountable for any AI system under Consumer Duty and operational resilience requirements. A smaller, more controllable model with clear training data may be easier to audit and explain to a regulator, which is a genuine consideration if you’re in a sector with active oversight.
What else connects to small language models?
SLMs sit inside a wider AI vocabulary you’ll encounter as you build capability in your business. Three terms come up regularly alongside them: RAG, or retrieval-augmented generation (where the model searches your documents before answering rather than relying solely on its training), fine-tuning (adapting the model on your own data to improve accuracy on specific tasks), and agent frameworks (where multiple models coordinate to complete multi-step tasks).
RAG is often the better starting point for an owner-managed business than fine-tuning, because you don’t need to retrain the model from scratch. You give it a reference library of your own documents and it searches before responding. This produces practical results for many businesses with less specialist overhead than a full retraining programme.
Agent frameworks are increasingly where SLMs are being deployed as specialised components: a larger orchestrating model coordinates several SLMs, each handling a specific step in a workflow, which is often considerably cheaper than routing the entire chain through a single frontier model.
If your work involves personal data, the ICO’s DPIA guidance and the NCSC and CISA joint guidance on secure AI system development are both practical starting points, regardless of which model size you choose. The principles in that guidance hold across the board: minimise the data sent, encrypt in transit, apply strong access controls, and log what the system does. Model size changes the cost and data-residency picture. It doesn’t change the governance obligations.
A small language model is a practical tool for well-defined, high-frequency jobs. If your team is handling volume text tasks and you’re uncomfortable with where that data is going, it’s worth a proper assessment of whether an SLM-based approach fits. Book a conversation to think through which AI tools make sense for your business, and which would be wasted on problems they’re not built for.



