Reducing Risk in Token Cost and Performance in AI Systems

Managing system stability and business cost. Model providers measure usage in tokens, this strategy covers how to measure and control both costs and system performance.

Dec 2025 12 min

What is it

Model providers often have specific token rate limits and costs. Token usage can result in both excessive cost and unstable systems if rate limits are unexpectedly hit. In 2025, these limits are increasingly tiered based on your account’s payment history and usage “reputation”.

Why it matters

Model providers measure usage not in compute but in tokens. Tokens are accounted for on every transaction, including input and output tokens, with many different types of limits and costs depending on the type. This practice covers how to measure and control both costs and system performance.

Strategy Summary

This strategy will be split into two main areas: system stability and token expenditure. However, the strategy applies to both.

Design & Safety First

Pre-release

Post-release

Examples of Rate Limits

Rate Limit Expansion Example (Tier-based)

This gives some illustration of how complicated it can be to increase rate limits. You cannot simply pay for a higher limit; you must often wait for specific milestones.

Rate Limit Tiers

TierQualificationUsage limits (Approx)
FreeUser must be in an allowed geography$100 / month
Tier 1$5 minimum deposit$100 / month
Tier 2$50 paid + 7 days since first successful payment$500 / month
Tier 3$100 paid + 7 days since first successful payment$1,000 / month
Tier 4$250 paid + 14 days since first successful payment$5,000 / month
Tier 5$1,000 paid + 30 days since first successful payment$200,000 / month

Model Specific Limits (Example)

The below table shows some example limits. Note that these are relatively simple, but if using other services the rate limiting can be far more complicated.

Model-Specific Rate Limits

ModelToken limits (TPM)Request limits (RPM)Batch queue limits
gpt-5.1500,000500900,000 TPD
gpt-5-mini500,0005005,000,000 TPD
gpt-4.1-mini200,0005002,000,000 TPD

Example Calculation: For example, if gpt-4.1-mini were used and the average prompt input and output was 10,000 tokens per request, we can see that would allow us to have 20 requests per minute. However, if the average prompt token usage was 30 tokens, we could in theory have over 6,000 requests per minute; however, then we’d hit the 500 Requests per Minute (RPM) limit first instead.

Examples of Cost (Per 1 Million Tokens)

Token Costs by Model

ModelInput CostCached InputOutput Cost
gpt-4.1$2.00$1.00$8.00
gpt-4.1-mini$0.40$0.20$1.60
gpt-4.1-nano$0.10$0.05$0.40

Here we can see an example of the differences in costs between models, which is significant in terms of model capability and input and output costs per 1 million tokens.


Part 1: Managing System Stability (Rate Limits)

This section focuses on ensuring the application doesn’t fail under load.

Account for All Rate Limits

If a foundation model is being used, there are typically rate limits applied to both input and output tokens. These are also split based on:

All of these must be accounted for in the calculations to ensure that when deployed the system does not hit a rate limit unexpectedly. Most models, if the rate limit is hit, will immediately stop responding. For example: if the TPM rate limit is hit in the first 20s of a minute, for the last 40s there will be no traffic accepted until the time resets.

The rate limits operate based on whichever rate limit gets hit first. Example: 100 small requests might hit your RPM limit even if you’ve only used 5% of your TPM. Equally, if the model context window is big enough, you could hit the TPM rate limit with only a few requests a minute if the prompts are large enough.

Types of call must also be accounted for; for example, if data is being embedded before being uploaded to a vector database, this would likely consume a lot of tokens in a spike. This could easily push the rate limit over or block other calls to the embedding endpoint during upload.

Evaluate Consumption

This is a critical test. What happens if a user pastes a 100-page document into your RAG system’s input?

Monitor Pre- and Post-Release Usage

Track your token consumption in staging to forecast your needs in production.

Example Scenario: “A test with 10 virtual users in staging hit a 100,000 TPM limit. This proves that the system cannot scale to the 1,000 users expected at launch.”

Plan for Scalability (The “What If It Works?” Problem)

Do not assume rate limits can be expanded immediately. This is a common failure point. A request to increase a rate limit can take days or weeks. This must be part of your pre-release planning.

Example Scenario: “The app launch was a success, but after 500 users, the system hit its TPM limit and failed for all new users. The team had not requested a limit increase, and the system was down for 48 hours until the rate limit could be increased.”

Mitigation Strategies


Part 2: Managing Business Viability (Cost)

This section focuses on ensuring the LLM based application is financially sustainable.

Account for All Token Types (Observability)

Track Token Usage Graphically

Pipe your token data into an observability platform (like Grafana, Datadog, etc.) to answer:

Calculate Cost-per-Feature

Example Scenario: “After tracking costs, it was found that the ‘auto-summarise’ feature, which was rarely used, was responsible for 40% of the total cost. This was because it ran on GPT-4 for every new document, even if no user ever read the summary. The team immediately changed this to be an ‘on-demand’ feature, saving 40% of their bill.”

This also enables less important features to be identified and switched off in emergencies where token usage is too high in the system.

Cost Mitigations

Conclusion

Evaluating token usage is an iterative process. By combining safety design (circuit breakers), rate limit planning (tiers and regions), and cost observability (dashboards), you ensure that your model application remains both stable for users and sustainable for the business.