A 50-staff specialist consultancy sat through a board meeting in May 2026 with one item on the agenda: a three-year AI spend forecast through 2029. The COO had built a model on top of last year’s invoice with a 15% annual increase, the way the firm forecasts its Microsoft 365 line. The CFO did not believe the number. The OpenAI bill had jumped from £4,800 a month to £11,200 in nine months, the firm had switched some workflows onto Claude Opus, and the consulting team was now using o3 for complex client analysis at roughly thirty times the per-token cost of GPT-5.
The owner could not explain to the board why the prices kept moving. The honest answer is that the firm is buying frontier capability on a market where the cost of that capability is set by physics, not by SaaS competition. The shape of that physics has a name. Scaling laws.
What are scaling laws?
Scaling laws are empirical equations describing how AI model capability improves as you give the model more training compute, more parameters, or more training data. They are power-law relationships, which means improvements arrive predictably but with diminishing returns. The Kaplan paper from OpenAI established the original curves in 2020. The Chinchilla paper from DeepMind refined them in 2022 with the rule that tokens and parameters should scale together at around 20 to 1.
What this means for you is that the cost of frontier capability has a floor. There is no clever workaround that gets GPT-5-level reasoning at GPT-3-level prices. Earlier generations, including GPT-3, were undertrained relative to the Chinchilla ratio at around 1.7 tokens per parameter, which is one of the reasons later models built on similar compute budgets pulled ahead so sharply. From 2023 onwards, most frontier developers have rebalanced. The curves you are paying for are well-measured, and the industry has spent billions confirming they hold.
The practical posture for an owner is to treat the curves as the operating physics of the market, not as a research footnote. You do not need the equations. You need the picture: more compute buys more capability, predictably, at increasing absolute cost. Every pricing decision a vendor makes sits on that picture.
Why does your AI bill behave the way it does?
Your bill behaves the way it does because three different scaling axes now sit underneath it, and each one prices differently. Pretraining scaling sets the cost of the base model. Post-training scaling, the fine-tuning and alignment work on top, makes mid-tier models more capable per pound. Test-time scaling, introduced commercially with OpenAI’s o1 in late 2024, lets a model spend extra inference compute reasoning before it answers, at proportionally extra cost.
Three direct consequences for an SME. First, frontier prices are not falling smoothly. Compute capacity is constrained, and OpenAI now offers GPT-5.5 at four price points on the same model, Priority at 2.5 times standard, then Standard, Flex and Batch at lower rates. Second, reasoning models add a per-task multiplier that compounds inside agent loops. Tianpan Co’s analysis shows a single query that costs 7 tokens with a fast model can cost 603 tokens with an aggressively configured reasoning model, and one agent task often runs twelve sequential calls. Third, mid-tier models keep closing the gap. Claude Haiku 4.5 reaches Claude Sonnet 4-level coding performance at roughly a third of the cost. Read vendor announcements through this lens and the moves stop looking arbitrary.
Where will you actually meet scaling laws in practice?
You meet them every time you choose a model, read a vendor announcement, or design an automation. The moment you pick GPT-5.5 over Haiku 4.5 you are buying a position on the scaling curve. The moment you switch a workflow onto o3 you are activating the test-time-compute axis. None of those decisions feel technical, but the cost shape they create lands directly in your invoice.
You also meet scaling laws when announcements mention the data wall, synthetic data, or the EU AI Act’s 10^25 FLOP threshold. Frontier labs have already trained on a meaningful share of high-quality web text, which is why pretraining scaling slowed and the industry pivoted to test-time compute. The EU regulatory threshold uses training compute as a proxy for capability, which is a regulator quietly acknowledging that scaling laws hold. You are not running into research, you are running into the operating environment of the market.
A third place owners meet scaling laws is in the gap between a vendor demo and the bill three months later. The demo runs on the highest tier because that produces the cleanest output. Production traffic at frontier rates compounds quickly, especially inside agent loops. Treat any demo run on Opus or o3 as the upper bound, then ask the vendor to rerun the same task on the next tier down before you sign anything.
When should you upgrade to frontier, stay on mid-tier, or fine-tune?
Match the model tier to the task on cost-per-task-completed, not cost-per-token. For high-value knowledge work where a 30-minute saving justifies a £30 token bill, frontier reasoning models are usually the right answer. Legal analysis, complex code generation, regulated compliance research, R&D-intensive design optimisation all qualify. The capability premium translates directly into protected revenue or saved partner time, and the maths works at almost any frontier price.
For repetitive work at scale, 100,000 chatbot queries a month, a million document classifications, route optimisation across 10,000 deliveries a day, mid-tier or fine-tuned smaller models will finish the work at around 10% of the cost with negligible quality loss. For real-time interaction, voice support, live chat, trading, latency rules out reasoning models entirely. Run a quarterly benchmark on three real queries from your business across tiers. The cheap mistake is paying frontier prices for templated work. The expensive mistake is asking a too-small model to do hard reasoning.
Related concepts
Frontier model is the small set of leading-edge systems sitting at the top of the scaling curve. Inference cost is what you pay every time a model answers, and TCO for AI is the full year-three view including infrastructure, integration, and people. The three together cover the spend story end to end.
Chain-of-thought prompting is the technique that test-time compute scaled into a separate axis. Hybrid AI pricing is the procurement architecture that makes scaling laws workable in practice, mixing tiers across workflows so you pay frontier prices only where they earn their keep. The vocabulary on this page sits underneath those decisions. The next vendor announcement that mentions training compute, tokenizer efficiency, or reasoning-mode pricing will read differently with the curves in mind.



