What are rate limits? Why they matter for your AI bill

Two people at a desk reviewing an API dashboard on a laptop with a printed error log beside them
TL;DR

A rate limit is a vendor cap on how fast you can use an AI service, measured in requests per minute, tokens per minute, or daily quotas. Hit one and the API returns an HTTP 429 error. Many UK SMEs discover rate limits in production, when a customer-facing pilot scales from 50 enquiries a day to 500 and the screening bot starts failing mid-shift. The fix is rarely expensive once you know the shape, but the surprise is.

Key takeaways

- A rate limit caps how fast you can call an AI API, measured three ways: requests per minute, tokens per minute, and daily quotas. You can hit any of them. - Vendors set limits to protect shared infrastructure, allocate capacity fairly, and price tiers commercially. The third reason is the one that bites SMEs scaling beyond a pilot. - The 2026 landscape is tiered. OpenAI runs a five-tier spend-based ladder, Anthropic raised limits significantly after the SpaceX compute deal, Google uses regional quotas, AWS Bedrock is per-model and per-region, Azure is split by deployment type. - When you hit a limit the API returns HTTP 429. Production code should treat 429 as a flow-control signal, retrying with exponential backoff and jitter rather than failing or hammering the endpoint. - Ignore rate limits at the prototype stage. Plan for them the moment you put customer-facing automation, multi-vendor integration, or agentic workflows into production.

A 22-staff recruitment firm I work with trialled an AI screening assistant for candidate emails. The pilot ran 50 enquiries a day for a month, well within free-tier limits. Promotion to general availability went smoothly until the Friday of a major job-fair week, when 500 candidate emails landed in four hours. The screening API started returning HTTP 429 “Too Many Requests” errors. The automation had no retry logic, and dozens of legitimate candidates received the firm’s automated apology email instead of a screening response.

Investigation took an afternoon. The firm was on OpenAI Tier 1 at 500,000 tokens per minute and the spike had needed Tier 2 capacity. Upgrading would have cost roughly £45 a month. The cost of the failure was a frosty conversation with the firm’s largest client about why their candidate pipeline went silent at the worst possible moment. The owner’s question afterwards: what is a rate limit, and why did nobody flag it in the pilot?

What are rate limits?

A rate limit is a vendor-imposed cap on how fast you can call an API. It is measured three ways and your business can hit any of them: requests per minute, tokens per minute (where a token is roughly four characters of English), and daily or monthly quotas. A typical setup might allow 5,000 requests per minute and 1 million tokens per minute, and you can be throttled by either ceiling.

When you exceed a limit the API returns an HTTP 429 status code, usually with a Retry-After header telling you when to come back. The mechanics are documented in the HTTP specification at MDN. Every major vendor also exposes response headers like x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens, so a careful team can monitor utilisation in real time rather than discovering the ceiling under load.

Why do they matter for your business?

They matter because they are the moment your AI integration stops being a cost question and becomes an availability question. A pilot that handled 50 enquiries a day with comfortable headroom can fail abruptly at 500, not because the use case stopped working, but because the vendor tier is a ceiling and the business crossed it. Customer-facing automation that breaks under load is more visible than a slightly higher subscription would have been.

There are three reasons vendors enforce rate limits, and they sit on top of each other. The first is infrastructure protection: a shared API serving thousands of customers cannot let one runaway script monopolise capacity. The second is fair allocation, ensuring a £1m turnover SME gets predictable performance even when a Fortune 500 customer is on the same backbone. The third is commercial: tiers are how vendors monetise, with lower-priced tiers carrying lower limits. The first two benefit you. The third bites firms scaling past their pilot tier without realising they have done so.

Where will you actually meet them?

You meet rate limits in three places: the vendor’s tier documentation, the live usage dashboard, and the production incident itself when a customer-facing automation starts returning 429s under load and the on-call engineer has to read response headers to work out which limit was hit. The first two are where careful teams catch the problem early. The third is where a recruitment, logistics, or services firm typically discovers it the hard way.

The 2026 vendor landscape is tiered but the shape varies. OpenAI runs a five-tier ladder based on cumulative spend, from $5 unlocking Tier 1 at 500,000 tokens per minute up to $1,000 unlocking Tier 5 at 40 million. Anthropic uses a similar ladder, recently raised after the SpaceX Colossus 1 compute partnership, with Tier 1 now around 500,000 input tokens per minute for Opus. Google Vertex AI uses regional quotas requested through the cloud console rather than spend-based tiers. AWS Bedrock applies per-model per-region quotas with a provisioned throughput option for predictable workloads. Azure’s OpenAI service splits by deployment type and requires explicit quota requests through the portal.

Three SME scenarios where it bites

The pattern repeats across firms I work with. The first is customer-facing AI on Friday-afternoon load spikes, the recruitment pattern that opens this post. The second is overnight batch processing, where a professional services firm tries to run 10,000 documents through a synchronous API and hits a token-per-minute ceiling that turns a one-hour job into a six-hour one. OpenAI’s Batch API at half the cost and 24-hour turnaround would have been the right tool. The third is agentic workflows, where a single user query triggers five to twelve internal API calls and the team’s capacity model underestimates real consumption by a factor of five.

When to plan for them, when to ignore them

The decision rule is operational, not technical. Ignore rate limits while you are prototyping or running a sub-10-staff pilot on free or entry-tier limits. The free tier on OpenAI, Anthropic and Google is generous enough that a typical small SME running administrative AI tasks will not hit a ceiling for months. Watch the dashboard, note actual consumption, and revisit once you have real data.

Plan for rate limits the moment four conditions appear. Customer-facing automation in production, where a 429 means a real customer sees a failure. Multi-vendor integration, where total consumption needs modelling across services. Predictable peak loads (seasonal hiring, payroll month-end, holiday logistics) where the spike is forecastable. Or agentic workflows where one user query triggers many internal calls. In any of those, model peak load with a 20 to 30 percent safety margin, pick the tier that covers it, build exponential backoff and jitter into the retry logic regardless, and review quarterly as the business grows.

What to ask the vendor before signing

Six questions are worth asking before any contract is signed, and they are rarely on a standard procurement checklist. What are the default rate limits for the tier I am buying, per API key and globally? What headers and dashboards will tell me when I am at 70 percent utilisation? What is the timeline to increase limits, hours or weeks? Does upgrading reduce per-call cost or just buy headroom? When I exceed a limit, do you return 429 or queue and delay? And the one that matters: have rate limits or model behaviour changed unannounced recently, with Anthropic’s April 23 Claude Code postmortem as a reference for the kind of incident this surfaces?

Tokens are the unit counted on the tokens-per-minute axis of every rate limit, roughly four characters or three-quarters of an English word. Knowing your typical request size in tokens turns a vendor’s TPM limit into a real capacity figure. A 1,500-token query against a 500,000 TPM ceiling gives 333 requests a minute of headroom, the number that tells you whether the pilot scales.

Input and output tokens are rate-limited separately by some vendors, with output capacity typically a fraction of input because generation is computationally more expensive than reading. Anthropic’s ladder runs input limits at five to ten times output limits at every tier. If your workload is output-heavy, the binding constraint is the output ceiling.

Prompt caching is the architectural lever that buys headroom without a tier upgrade. Cached input tokens count against rate limits at a fraction of the standard rate on many vendors, so a workload reusing the same system prompt across thousands of calls can sit comfortably on a lower tier than the equivalent uncached workload. For volume-heavy use cases this is often a bigger lever than the tier choice itself.

Inference cost is the broader frame around this. Rate limits exist because inference is finite. The vendors raising tier ceilings are doing so because they are bringing more compute online. The 2026 landscape is more generous than 2024, but tiers still exist and a careful team treats them as a first-class architectural concern.

If you want to talk through where your own usage sits and what to bake into production code before the next load spike, book a conversation.

Sources

OpenAI (2026). API rate limits guide, the definitive five-tier spend-based ladder including per-model RPM and TPM. https://developers.openai.com/api/docs/guides/rate-limits Anthropic (2026). Higher limits announcement following the SpaceX Colossus 1 compute partnership, raising Claude tier ceilings significantly. https://www.anthropic.com/news/higher-limits-spacex Google Cloud (2026). Vertex AI generative AI quotas, regional and per-model limits with the explicit quota-request workflow. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/quotas AWS (2026). Bedrock service quotas, per-model and per-region defaults plus the provisioned throughput option for predictable workloads. https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html Microsoft (2026). Azure OpenAI Service quotas and limits, split by deployment type with the portal-based quota increase request flow. https://learn.microsoft.com/en-us/azure/foundry/openai/quotas-limits Mozilla Developer Network (2026). HTTP 429 Too Many Requests, the canonical specification including Retry-After header behaviour. https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/429 Anthropic (2026). April 23 postmortem on the Claude Code reasoning regression, a worked example of vendor-side changes draining rate limits unexpectedly. https://www.anthropic.com/engineering/april-23-postmortem OpenAI (2026). Batch API documentation, the 50% discount alternative for non-real-time workloads with separate higher rate limits. https://developers.openai.com/api/docs/guides/batch Atlassian Developer (2026). Rate limiting and retries guidance for app developers, including the exponential backoff with jitter pattern. https://developer.atlassian.com/platform/app-migration/rate-limiting-and-retries/ Telnyx (2026). Rate limit headers explainer, the canonical x-ratelimit-* header set used by major vendors. https://telnyx.com/resources/rate-limit-headers

Frequently asked questions

How do I find out which rate limit my account is on right now?

Every major vendor surfaces this in two places. The first is the developer console: OpenAI shows your tier on the limits page, Anthropic on the workspace settings, Google through Vertex AI quotas, AWS through the Bedrock service quotas console. The second is the response headers on every API call, fields like x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens. If your team has not surfaced these in a dashboard, that is the first hour of work to do.

What does HTTP 429 actually look like in production, and how should the code respond?

429 is a status code returned when you exceed a limit, usually with a Retry-After header that says how long to wait. Production code should respond with exponential backoff and jitter, waiting one second on the first retry, two to three on the second, four to six on the third, with random jitter so a hundred clients hitting the limit at once do not all retry at the same instant. After three to five attempts, the request fails to a manual queue. Treating 429 as a normal flow-control signal rather than an error is the architectural shift.

Should I just buy the highest tier to avoid the problem?

Usually no. Higher tiers are not always more expensive per call, but they often require cumulative spend history that a new account does not have, and the headroom is wasted if your real load sits in a lower tier. The sensible sequence is to model peak load, add 20 to 30 percent for growth and surprises, pick the tier that covers it, and build retry logic anyway because vendor changes and unexpected load spikes happen regardless of tier.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation