A 22-staff recruitment firm I work with trialled an AI screening assistant for candidate emails. The pilot ran 50 enquiries a day for a month, well within free-tier limits. Promotion to general availability went smoothly until the Friday of a major job-fair week, when 500 candidate emails landed in four hours. The screening API started returning HTTP 429 “Too Many Requests” errors. The automation had no retry logic, and dozens of legitimate candidates received the firm’s automated apology email instead of a screening response.
Investigation took an afternoon. The firm was on OpenAI Tier 1 at 500,000 tokens per minute and the spike had needed Tier 2 capacity. Upgrading would have cost roughly £45 a month. The cost of the failure was a frosty conversation with the firm’s largest client about why their candidate pipeline went silent at the worst possible moment. The owner’s question afterwards: what is a rate limit, and why did nobody flag it in the pilot?
What are rate limits?
A rate limit is a vendor-imposed cap on how fast you can call an API. It is measured three ways and your business can hit any of them: requests per minute, tokens per minute (where a token is roughly four characters of English), and daily or monthly quotas. A typical setup might allow 5,000 requests per minute and 1 million tokens per minute, and you can be throttled by either ceiling.
When you exceed a limit the API returns an HTTP 429 status code, usually with a Retry-After header telling you when to come back. The mechanics are documented in the HTTP specification at MDN. Every major vendor also exposes response headers like x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens, so a careful team can monitor utilisation in real time rather than discovering the ceiling under load.
Why do they matter for your business?
They matter because they are the moment your AI integration stops being a cost question and becomes an availability question. A pilot that handled 50 enquiries a day with comfortable headroom can fail abruptly at 500, not because the use case stopped working, but because the vendor tier is a ceiling and the business crossed it. Customer-facing automation that breaks under load is more visible than a slightly higher subscription would have been.
There are three reasons vendors enforce rate limits, and they sit on top of each other. The first is infrastructure protection: a shared API serving thousands of customers cannot let one runaway script monopolise capacity. The second is fair allocation, ensuring a £1m turnover SME gets predictable performance even when a Fortune 500 customer is on the same backbone. The third is commercial: tiers are how vendors monetise, with lower-priced tiers carrying lower limits. The first two benefit you. The third bites firms scaling past their pilot tier without realising they have done so.
Where will you actually meet them?
You meet rate limits in three places: the vendor’s tier documentation, the live usage dashboard, and the production incident itself when a customer-facing automation starts returning 429s under load and the on-call engineer has to read response headers to work out which limit was hit. The first two are where careful teams catch the problem early. The third is where a recruitment, logistics, or services firm typically discovers it the hard way.
The 2026 vendor landscape is tiered but the shape varies. OpenAI runs a five-tier ladder based on cumulative spend, from $5 unlocking Tier 1 at 500,000 tokens per minute up to $1,000 unlocking Tier 5 at 40 million. Anthropic uses a similar ladder, recently raised after the SpaceX Colossus 1 compute partnership, with Tier 1 now around 500,000 input tokens per minute for Opus. Google Vertex AI uses regional quotas requested through the cloud console rather than spend-based tiers. AWS Bedrock applies per-model per-region quotas with a provisioned throughput option for predictable workloads. Azure’s OpenAI service splits by deployment type and requires explicit quota requests through the portal.
Three SME scenarios where it bites
The pattern repeats across firms I work with. The first is customer-facing AI on Friday-afternoon load spikes, the recruitment pattern that opens this post. The second is overnight batch processing, where a professional services firm tries to run 10,000 documents through a synchronous API and hits a token-per-minute ceiling that turns a one-hour job into a six-hour one. OpenAI’s Batch API at half the cost and 24-hour turnaround would have been the right tool. The third is agentic workflows, where a single user query triggers five to twelve internal API calls and the team’s capacity model underestimates real consumption by a factor of five.
When to plan for them, when to ignore them
The decision rule is operational, not technical. Ignore rate limits while you are prototyping or running a sub-10-staff pilot on free or entry-tier limits. The free tier on OpenAI, Anthropic and Google is generous enough that a typical small SME running administrative AI tasks will not hit a ceiling for months. Watch the dashboard, note actual consumption, and revisit once you have real data.
Plan for rate limits the moment four conditions appear. Customer-facing automation in production, where a 429 means a real customer sees a failure. Multi-vendor integration, where total consumption needs modelling across services. Predictable peak loads (seasonal hiring, payroll month-end, holiday logistics) where the spike is forecastable. Or agentic workflows where one user query triggers many internal calls. In any of those, model peak load with a 20 to 30 percent safety margin, pick the tier that covers it, build exponential backoff and jitter into the retry logic regardless, and review quarterly as the business grows.
What to ask the vendor before signing
Six questions are worth asking before any contract is signed, and they are rarely on a standard procurement checklist. What are the default rate limits for the tier I am buying, per API key and globally? What headers and dashboards will tell me when I am at 70 percent utilisation? What is the timeline to increase limits, hours or weeks? Does upgrading reduce per-call cost or just buy headroom? When I exceed a limit, do you return 429 or queue and delay? And the one that matters: have rate limits or model behaviour changed unannounced recently, with Anthropic’s April 23 Claude Code postmortem as a reference for the kind of incident this surfaces?
Related concepts
Tokens are the unit counted on the tokens-per-minute axis of every rate limit, roughly four characters or three-quarters of an English word. Knowing your typical request size in tokens turns a vendor’s TPM limit into a real capacity figure. A 1,500-token query against a 500,000 TPM ceiling gives 333 requests a minute of headroom, the number that tells you whether the pilot scales.
Input and output tokens are rate-limited separately by some vendors, with output capacity typically a fraction of input because generation is computationally more expensive than reading. Anthropic’s ladder runs input limits at five to ten times output limits at every tier. If your workload is output-heavy, the binding constraint is the output ceiling.
Prompt caching is the architectural lever that buys headroom without a tier upgrade. Cached input tokens count against rate limits at a fraction of the standard rate on many vendors, so a workload reusing the same system prompt across thousands of calls can sit comfortably on a lower tier than the equivalent uncached workload. For volume-heavy use cases this is often a bigger lever than the tier choice itself.
Inference cost is the broader frame around this. Rate limits exist because inference is finite. The vendors raising tier ceilings are doing so because they are bringing more compute online. The 2026 landscape is more generous than 2024, but tiers still exist and a careful team treats them as a first-class architectural concern.
If you want to talk through where your own usage sits and what to bake into production code before the next load spike, book a conversation.



