A consultant recommends setting up a “model router” for your new customer chat system. You nod. You write it down. Later, alone, you search for what it actually means, and many explanations assume you already know what an API call costs, why you’d run more than one model, and what throughput means in practice. This post assumes none of that. It explains what a model router is, how it works, and when a firm like yours genuinely needs one.
What is an AI model router?
An AI model router is software that sits between your application and two or more AI models, directing each incoming request to whichever model makes sense for that task. A short, factual question might go to a small, cheap model. A long document requiring careful reasoning goes to a more capable one. The router makes that choice automatically, in milliseconds.
The analogy Microsoft uses is a smart switchboard. Instead of one phone line to “the AI”, you have several, and the switchboard decides in real time which line handles each call. Azure’s model router works this way for OpenAI’s GPT family, automatically distributing requests based on how complex each prompt appears to be.
Amazon’s Bedrock platform offers a similar pattern with Anthropic’s models: a router predicts which model in a pool will give the best outcome for cost and quality, then routes accordingly. The user and the application see a single endpoint. The routing happens behind the scenes.
The economics behind this matter. Public pricing on AI APIs spans a roughly 300-fold range, from around US$0.10 per million tokens at the cheap end to US$30 or more at the high end. A router that sends straightforward requests to cheap models and complex ones to expensive models can, in principle, keep costs down without degrading quality for the tasks that genuinely need it. MindStudio, a commercial routing provider, reports organisations seeing 30 to 70 per cent cost reductions using this approach, though those figures come from vendor marketing materials rather than independent benchmarks.
How does a router decide where each request goes?
The router analyses the prompt, estimates how complex it is, then picks a model from a pre-configured pool. Azure’s router considers the full request including conversation history, then scores it for likely difficulty. AWS describes two routing strategies: static rules that assign request types to fixed models, and dynamic routing that uses machine learning to predict which model will perform best.
Azure exposes three routing modes. “Balanced” aims for the best overall cost-quality mix. “Cost” aggressively prefers cheaper models and only escalates when the prompt seems too hard for them. “Quality” pushes everything to the highest-capability models regardless of price. You configure the mode once, and the router applies it to every request.
After selecting a model, the router forwards the request and returns the response. Microsoft’s documentation notes that the router itself adds only a negligible fraction to the total processing time.
Routers also handle failure gracefully. If a model is unavailable, hits its rate limit, or returns a low-confidence response, a well-built router will retry on a different model from the pool rather than simply failing. Azure shows the model name it used in each API response, so you can audit which model actually served which request. That audit trail matters for regulated firms.
Where will you actually meet model routing?
For a small UK services firm, you will most likely encounter model routing through the managed cloud platforms you already use or are evaluating, rather than by building your own routing layer. Azure, AWS Bedrock, and commercial platforms like MindStudio all offer routing as a built-in feature. You configure it; you don’t engineer it from the ground up.
The most common scenario for a firm of five to fifty people is a customer-facing chatbot with a mix of query types. Appointment booking, FAQ responses, and simple status enquiries can go to cheaper, faster models. More complex queries, anything requiring detailed reasoning, legal interpretation, or nuanced advice drafting, go to a more capable model. The routing is invisible to the customer; it happens inside the platform.
A second scenario is internal tooling. If your staff use AI for a mix of tasks, quick email drafts, summarising documents, writing first-pass reports, a router can ensure the cheap model handles the quick tasks and the expensive model handles the ones that genuinely need it.
You also meet routing indirectly when cloud providers manage it for you by default. Azure’s router is presented as a first option for general-purpose OpenAI workloads. You don’t always need to know it is happening.
When does routing make sense, and when should you ignore it?
Model routing is worth considering when you have high query volume, mixed task complexity, and meaningful AI spend. Microsoft specifically recommends it for user-facing applications like customer support chatbots where latency matters and many requests are simple. For a firm with modest internal AI use, a straightforward setup with one well-chosen model is usually more practical.
The business case strengthens as volume grows. If you send only a few dozen AI requests a day, the architectural overhead of a router, another service to configure, monitor, and secure, will cost more in time than it saves in API fees.
Routing also gets complicated in regulated environments. If you are a firm in financial services or working with NHS contracts, adding multiple AI providers means multiple chains to document, monitor, and take responsibility for. The FCA’s expectations on third-party AI are clear: regulated firms remain accountable for outcomes even when using external models. Azure itself recommends direct, single-model deployments for specialised or compliance-sensitive workloads, not routing. Adding routing complexity to those situations increases audit work rather than reducing it.
The NCSC has also noted that routing introduces an additional surface area to secure. Authentication, logging, and rate-limit management all need attention. For a small firm without a technical team or a technical partner, this is a real consideration.
What related concepts should you know?
Model routing sits in a wider set of ideas around how organisations deploy and manage AI models at any kind of scale. If someone in a supplier conversation mentions prompt routers, AI gateways, multi-agent architectures, or inference proxies, they are pointing at adjacent patterns with overlapping goals. Understanding the rough shape of each helps you ask better questions rather than defer to the jargon.
A prompt router is the closest relative. Where a model router chooses between AI models, a prompt router may also direct requests to non-AI tools, databases, or code functions. Some platforms use the terms interchangeably.
An AI gateway sits at a different layer. It manages authentication, rate limiting, logging, and cost controls across all AI API calls from your organisation, regardless of which model is used. Think of it as the security and billing layer, with the router sitting inside it or alongside it.
Multi-agent architectures take routing a step further. Instead of choosing between models for a single request, they chain multiple AI calls together, with each agent handling part of a task. Routing logic becomes part of the coordination between agents.
For a small firm at the early stages of AI adoption, the practical value in knowing about these patterns is mainly conversational: you can follow the discussion and ask informed questions when a supplier or consultant raises them.
The terminology around AI infrastructure is expanding quickly, and model routing is one of the more useful concepts to have in your vocabulary before you sit down with a supplier. You don’t need to build it. You may need to configure it. Knowing what it is, and when it helps versus when it adds friction, puts you in a much stronger position than nodding along.



