What are scaling laws? Why your AI bill behaves the way it does

Two people at a meeting-room table reviewing a printed forecast with a laptop open to one side
TL;DR

Scaling laws are the empirical equations describing how AI capability rises as you spend more on training compute, parameters, training data, and now inference reasoning. They set the floor for what frontier capability costs. Frontier compute has grown roughly 4 to 5 times per year for the past eight years, reasoning models add a per-task premium, and mid-tier models trained on Chinchilla-aligned ratios keep closing the gap. Owners who match model tier to task on cost-per-task-completed avoid paying frontier prices for routine work.

Key takeaways

- Scaling laws are empirical power-law relationships between AI capability and the compute, parameters, and data used to train a model. They set the price floor for frontier capability. - The Kaplan paper from OpenAI in 2020 set the original curves. The Chinchilla paper from DeepMind in 2022 added the rule that tokens and parameters should scale roughly equally, around 20 tokens per parameter. - From late 2024 a third axis arrived. Reasoning models like o1 and o3 spend extra compute at inference to think before answering. That premium can be more than four times the per-token rate of the base model. - Frontier model bills oscillate because compute capacity is constrained, vendors are introducing tiered pricing on the same model, and reasoning calls inside agent loops compound fast. - The procurement rule is to benchmark cost per task completed, not cost per token, and to design the application so swapping a model is a configuration change rather than a rebuild.

A 50-staff specialist consultancy sat through a board meeting in May 2026 with one item on the agenda: a three-year AI spend forecast through 2029. The COO had built a model on top of last year’s invoice with a 15% annual increase, the way the firm forecasts its Microsoft 365 line. The CFO did not believe the number. The OpenAI bill had jumped from £4,800 a month to £11,200 in nine months, the firm had switched some workflows onto Claude Opus, and the consulting team was now using o3 for complex client analysis at roughly thirty times the per-token cost of GPT-5.

The owner could not explain to the board why the prices kept moving. The honest answer is that the firm is buying frontier capability on a market where the cost of that capability is set by physics, not by SaaS competition. The shape of that physics has a name. Scaling laws.

What are scaling laws?

Scaling laws are empirical equations describing how AI model capability improves as you give the model more training compute, more parameters, or more training data. They are power-law relationships, which means improvements arrive predictably but with diminishing returns. The Kaplan paper from OpenAI established the original curves in 2020. The Chinchilla paper from DeepMind refined them in 2022 with the rule that tokens and parameters should scale together at around 20 to 1.

What this means for you is that the cost of frontier capability has a floor. There is no clever workaround that gets GPT-5-level reasoning at GPT-3-level prices. Earlier generations, including GPT-3, were undertrained relative to the Chinchilla ratio at around 1.7 tokens per parameter, which is one of the reasons later models built on similar compute budgets pulled ahead so sharply. From 2023 onwards, most frontier developers have rebalanced. The curves you are paying for are well-measured, and the industry has spent billions confirming they hold.

The practical posture for an owner is to treat the curves as the operating physics of the market, not as a research footnote. You do not need the equations. You need the picture: more compute buys more capability, predictably, at increasing absolute cost. Every pricing decision a vendor makes sits on that picture.

Why does your AI bill behave the way it does?

Your bill behaves the way it does because three different scaling axes now sit underneath it, and each one prices differently. Pretraining scaling sets the cost of the base model. Post-training scaling, the fine-tuning and alignment work on top, makes mid-tier models more capable per pound. Test-time scaling, introduced commercially with OpenAI’s o1 in late 2024, lets a model spend extra inference compute reasoning before it answers, at proportionally extra cost.

Three direct consequences for an SME. First, frontier prices are not falling smoothly. Compute capacity is constrained, and OpenAI now offers GPT-5.5 at four price points on the same model, Priority at 2.5 times standard, then Standard, Flex and Batch at lower rates. Second, reasoning models add a per-task multiplier that compounds inside agent loops. Tianpan Co’s analysis shows a single query that costs 7 tokens with a fast model can cost 603 tokens with an aggressively configured reasoning model, and one agent task often runs twelve sequential calls. Third, mid-tier models keep closing the gap. Claude Haiku 4.5 reaches Claude Sonnet 4-level coding performance at roughly a third of the cost. Read vendor announcements through this lens and the moves stop looking arbitrary.

Where will you actually meet scaling laws in practice?

You meet them every time you choose a model, read a vendor announcement, or design an automation. The moment you pick GPT-5.5 over Haiku 4.5 you are buying a position on the scaling curve. The moment you switch a workflow onto o3 you are activating the test-time-compute axis. None of those decisions feel technical, but the cost shape they create lands directly in your invoice.

You also meet scaling laws when announcements mention the data wall, synthetic data, or the EU AI Act’s 10^25 FLOP threshold. Frontier labs have already trained on a meaningful share of high-quality web text, which is why pretraining scaling slowed and the industry pivoted to test-time compute. The EU regulatory threshold uses training compute as a proxy for capability, which is a regulator quietly acknowledging that scaling laws hold. You are not running into research, you are running into the operating environment of the market.

A third place owners meet scaling laws is in the gap between a vendor demo and the bill three months later. The demo runs on the highest tier because that produces the cleanest output. Production traffic at frontier rates compounds quickly, especially inside agent loops. Treat any demo run on Opus or o3 as the upper bound, then ask the vendor to rerun the same task on the next tier down before you sign anything.

When should you upgrade to frontier, stay on mid-tier, or fine-tune?

Match the model tier to the task on cost-per-task-completed, not cost-per-token. For high-value knowledge work where a 30-minute saving justifies a £30 token bill, frontier reasoning models are usually the right answer. Legal analysis, complex code generation, regulated compliance research, R&D-intensive design optimisation all qualify. The capability premium translates directly into protected revenue or saved partner time, and the maths works at almost any frontier price.

For repetitive work at scale, 100,000 chatbot queries a month, a million document classifications, route optimisation across 10,000 deliveries a day, mid-tier or fine-tuned smaller models will finish the work at around 10% of the cost with negligible quality loss. For real-time interaction, voice support, live chat, trading, latency rules out reasoning models entirely. Run a quarterly benchmark on three real queries from your business across tiers. The cheap mistake is paying frontier prices for templated work. The expensive mistake is asking a too-small model to do hard reasoning.

Frontier model is the small set of leading-edge systems sitting at the top of the scaling curve. Inference cost is what you pay every time a model answers, and TCO for AI is the full year-three view including infrastructure, integration, and people. The three together cover the spend story end to end.

Chain-of-thought prompting is the technique that test-time compute scaled into a separate axis. Hybrid AI pricing is the procurement architecture that makes scaling laws workable in practice, mixing tiers across workflows so you pay frontier prices only where they earn their keep. The vocabulary on this page sits underneath those decisions. The next vendor announcement that mentions training compute, tokenizer efficiency, or reasoning-mode pricing will read differently with the curves in mind.

Sources

Kaplan et al. (OpenAI, 2020). Scaling Laws for Neural Language Models. The foundational paper establishing the original power-law relationship between compute, parameters, and data. https://arxiv.org/abs/2001.08361 Hoffmann et al. (DeepMind, 2022). Training Compute-Optimal Large Language Models, the Chinchilla paper. The 20-tokens-per-parameter finding that reset compute allocation across the industry. https://arxiv.org/abs/2203.15556 NVIDIA (2025). How Scaling Laws Drive Smarter, More Powerful AI. Plain-English summary of the three scaling axes: pretraining, post-training, and test-time. https://blogs.nvidia.com/blog/ai-scaling-laws/ OpenAI (2024). OpenAI o1 System Card. The canonical description of test-time compute scaling and chain-of-thought reasoning at inference. https://openai.com/index/openai-o1-system-card/ Epoch AI (2024). Training compute of frontier AI models grows by 4-5x per year. The headline empirical finding on frontier compute growth. https://epoch.ai/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year Anthropic (2026). Introducing Claude Haiku 4.5. The worked example of post-training scaling delivering Sonnet-level performance at roughly a third of the cost. https://www.anthropic.com/news/claude-haiku-4-5 Tianpan Co. (2026). The Reasoning Model Premium in Agent Loops. Cost-compounding analysis when reasoning models are called inside agent workflows. https://tianpan.co/blog/2026-04-10-reasoning-model-premium-agent-loops Pure Storage (2026). Frontier AI is getting more expensive, and worse, less predictable. The 2026 pricing-volatility narrative including OpenAI's tiered pricing on the same model. https://blog.purestorage.com/perspectives/frontier-ai-is-getting-more-expensive/ Interconnects AI (2025). The data wall and synthetic data. Why pretraining scaling slowed and labs pivoted to test-time compute. https://www.interconnects.ai/p/the-data-wall EU AI Act (2024). Article 51, classification of general-purpose AI models with systemic risk at the 10^25 FLOP training-compute threshold. https://artificialintelligenceact.eu/article/51/

Frequently asked questions

Do I actually need to read the Kaplan and Chinchilla papers?

No. The two findings you need are simple enough to carry in a board meeting. Capability rises predictably with training compute, parameters, and data. Tokens and parameters should be balanced, roughly 20 to 1. Anything beyond that is detail your model providers handle. What you do with the framing is the procurement decision: which tier of model to buy for which type of task in your business.

Will AI prices keep falling like cloud computing did?

They fall on the cheap axis and rise on the expensive one. Mid-tier model prices have dropped sharply over the past two years and will keep falling. Frontier model prices are oscillating because compute capacity is the binding constraint, and reasoning-mode pricing is structurally expensive because each query buys real inference compute. Forecast on the assumption that the price of leading-edge capability stays high.

How should I forecast AI spend through 2028 without guessing?

Build the forecast around cost per task completed for each workflow, not a flat percentage uplift on last year. Assume frontier compute keeps growing 4 to 5 times per year, mid-tier prices keep falling, and reasoning-mode costs stay elevated. Run a quarterly benchmark across tiers on real queries from your business. That discipline will move the forecast within tolerance for three years out.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation