Reducing Risk in Token Cost and Performance in AI Systems

Managing system stability and business cost. Model providers measure usage in tokens, this strategy covers how to measure and control both costs and system performance.

Dec 2025 12 min

Two messages on launch day

Two messages land on the launch-day Slack channel. The first one is “the system is down, users are getting errors.” The second one, usually a few days later, is “we need to talk about the bill.” Both are token problems. They look like different problems because one is measured in HTTP errors and the other is measured in invoices, but the underlying cause is the same. Token usage went somewhere the team had not modelled.

This article is about modelling it before launch day, and observing it after.

What it is

Model providers have specific token rate limits and costs, and both are tied to the same underlying unit. Token usage drives both system stability and business cost. In 2026, those limits are increasingly tiered based on payment history and usage reputation, which means the limits you hit on day one are often not the limits the system needs.

Why it matters

Model providers do not measure usage in compute. They measure it in tokens. Every transaction counts, on both the input and output sides, with different limits and costs depending on which type of token is involved. Get the modelling wrong and the system becomes either unstable, expensive, or both. This article covers how to measure and control both.

Strategy summary

The approach splits into system stability and token expenditure. The strategy applies to both, and the design principles below sit above both.

Design and safety first

Pre-release

Post-release

Examples of rate limits

Tier-based rate limit expansion

Rate limit increases are rarely a matter of paying more. Most providers gate the higher tiers behind specific account milestones.

Rate limit tiers

TierQualificationUsage limits (approx)
FreeUser must be in an allowed geography$100 / month
Tier 1$5 minimum deposit$100 / month
Tier 2$50 paid + 7 days since first successful payment$500 / month
Tier 3$100 paid + 7 days since first successful payment$1,000 / month
Tier 4$250 paid + 14 days since first successful payment$5,000 / month
Tier 5$1,000 paid + 30 days since first successful payment$200,000 / month

Model-specific limits

The table below shows a relatively simple example. Real rate limiting on production services often gets considerably more involved.

Model-specific rate limits

ModelToken limits (TPM)Request limits (RPM)Batch queue limits
gpt-5.1500,000500900,000 TPD
gpt-5-mini500,0005005,000,000 TPD
gpt-4.1-mini200,0005002,000,000 TPD

Example calculation: If gpt-4.1-mini is used and the average prompt input plus output is 10,000 tokens per request, that allows 20 requests per minute before hitting the TPM limit. If the average request is 30 tokens, the TPM limit allows over 6,000 requests per minute, but the 500 RPM limit becomes the binding constraint instead.

Examples of cost (per 1 million tokens)

Token costs by model

ModelInput costCached inputOutput cost
gpt-4.1$2.00$1.00$8.00
gpt-4.1-mini$0.40$0.20$1.60
gpt-4.1-nano$0.10$0.05$0.40

The differences across models are significant. Output tokens are usually around four times the cost of input tokens, and a tier jump in model capability can mean a tenfold jump in cost.


Part 1: managing system stability

This half is about making sure the application does not fail under load.

Account for all rate limits

If a foundation model is being used, rate limits typically apply to both input and output tokens, and they are split along several dimensions:

All of them have to be accounted for. Most models, when a rate limit is hit, stop responding immediately. If the TPM limit is hit in the first 20 seconds of a minute, the remaining 40 seconds reject everything until the window resets.

The binding constraint is whichever limit gets hit first. A flood of small requests can hit the RPM limit while the TPM is at 5%. A handful of very large requests can hit the TPM with only a few calls per minute. Both happen in production, often within the same week.

The type of call matters too. Embedding a large document into a vector database will spike token consumption, which can either push the embedding endpoint over its own limit or starve other calls that share it.

Evaluate consumption

This is the test that catches the worst surprises. What happens when a user pastes a 100-page document into the input?

Monitor pre-release and post-release usage

Track token consumption in staging to forecast production needs.

Example scenario: “A test with 10 virtual users in staging hit a 100,000 TPM limit. The system cannot scale to the 1,000 users expected at launch without a rate limit increase or an architectural change.”

That kind of finding only works if pre-release testing exercises the system at representative volume. Token consumption per user is the unit. Multiply it out and compare to the limit before launch, not after.

Plan for scalability

Do not assume rate limits can be expanded immediately. This is one of the most common production failures. A request to increase a rate limit can take days, sometimes weeks. It needs to be part of pre-release planning, not a recovery option.

Example scenario: “The launch was a success. After 500 users, the system hit its TPM limit and failed for all new users. The team had not requested a limit increase, and the system was down for 48 hours.”

Mitigation strategies


Part 2: managing business viability

The first half kept the system standing. This half keeps it affordable.

Account for all token types

The fundamentals are the same as the observability strategy laid out in Observability, with cost tracking layered on top.

Track token usage graphically

Pipe the token data into an observability platform such as Grafana or Datadog. The questions worth answering on a dashboard are:

Calculate cost per feature

Example scenario: “Cost tracking showed that the ‘auto-summarise’ feature, which was rarely used, was responsible for 40% of the total bill. It was running on GPT-4 for every new document, even when no user ever read the summary. The team moved it to on-demand and cut 40% of the cost overnight.”

The same dashboard makes it possible to identify the lowest-value features quickly, which matters in two situations: long-term cost optimisation, and emergency throttling when usage spikes unexpectedly.

Cost mitigations

Conclusion

Token management is not a single decision. It is an iterative process that combines safety design, rate limit planning, and cost observability. Done early, it keeps the system stable for users and sustainable for the business. Done after the fact, it means the team is making both decisions in the middle of an incident, with a Slack channel full of red dots and a bill that has already been spent.

Final Thoughts

Back to the two messages on launch day. With this strategy in place, both messages still get sent occasionally, because no production system runs perfectly under all loads. But they get sent earlier, with numbers attached. The system-down message comes from a dashboard alert hitting 80% of TPM, not from users hitting 429s. The cost message comes from the daily spend chart trending upward, not from the monthly invoice. Both problems become forecasts instead of incidents, and that is the only meaningful difference between a system that scales and one that survives.