What is it
Model providers often have specific token rate limits and costs. Token usage can result in both excessive cost and unstable systems if rate limits are unexpectedly hit. In 2025, these limits are increasingly tiered based on your account’s payment history and usage “reputation”.
Why it matters
Model providers measure usage not in compute but in tokens. Tokens are accounted for on every transaction, including input and output tokens, with many different types of limits and costs depending on the type. This practice covers how to measure and control both costs and system performance.
Strategy Summary
This strategy will be split into two main areas: system stability and token expenditure. However, the strategy applies to both.
Design & Safety First
- The “Agentic” Circuit Breaker: Any system where an LLM can trigger a follow-up call (Agents, multi-step chains) must have a hard loop limit (e.g., “Max 5 turns”) to prevent an infinite, expensive recursion.
- Hard Boundaries: Implement input-side token limits. If your context window or budget allows for 8,000 tokens, reject a 10,000-token PDF before it hits the API to save money and prevent 429 errors.
Pre-release
- Calculate token rate limits, and account for the different rate limits that apply to different LLMs and endpoints.
- Calculate token costs, accounting for the different costs associated with different models and different types of tokens (for example: input, output, or cached).
- Monitor pre-release usage as part of the testing evaluations.
- Evaluate for unbounded consumption. For example: in agentic actions or when the LLM is able to output large numbers of tokens.
- Do not assume rate limits can be expanded immediately, often there are conditions to expanding rate limits and hard stops.
- If the release is close to the current token limits, assess if the release risk is too high or mitigate it by setting up additional resources.
Post-release
- Observability: Record the token usage for each call to an LLM.
- Account for all Tokens: Ensure that each token type is recorded, for example: embedding models, main model calls, cached tokens, batched tokens, input and output.
- Structured Storage: Store the token usage in a location that can persist usage statistics over time and where the data can be retrieved for analysis.
- Visual Monitoring: Track token usage graphically and compare against the pre-release numbers for anomalies.
Examples of Rate Limits
Rate Limit Expansion Example (Tier-based)
This gives some illustration of how complicated it can be to increase rate limits. You cannot simply pay for a higher limit; you must often wait for specific milestones.
Rate Limit Tiers
| Tier | Qualification | Usage limits (Approx) |
|---|---|---|
| Free | User must be in an allowed geography | $100 / month |
| Tier 1 | $5 minimum deposit | $100 / month |
| Tier 2 | $50 paid + 7 days since first successful payment | $500 / month |
| Tier 3 | $100 paid + 7 days since first successful payment | $1,000 / month |
| Tier 4 | $250 paid + 14 days since first successful payment | $5,000 / month |
| Tier 5 | $1,000 paid + 30 days since first successful payment | $200,000 / month |
Model Specific Limits (Example)
The below table shows some example limits. Note that these are relatively simple, but if using other services the rate limiting can be far more complicated.
Model-Specific Rate Limits
| Model | Token limits (TPM) | Request limits (RPM) | Batch queue limits |
|---|---|---|---|
| gpt-5.1 | 500,000 | 500 | 900,000 TPD |
| gpt-5-mini | 500,000 | 500 | 5,000,000 TPD |
| gpt-4.1-mini | 200,000 | 500 | 2,000,000 TPD |
Example Calculation: For example, if gpt-4.1-mini were used and the average prompt input and output was 10,000 tokens per request, we can see that would allow us to have 20 requests per minute. However, if the average prompt token usage was 30 tokens, we could in theory have over 6,000 requests per minute; however, then we’d hit the 500 Requests per Minute (RPM) limit first instead.
Examples of Cost (Per 1 Million Tokens)
Token Costs by Model
| Model | Input Cost | Cached Input | Output Cost |
|---|---|---|---|
| gpt-4.1 | $2.00 | $1.00 | $8.00 |
| gpt-4.1-mini | $0.40 | $0.20 | $1.60 |
| gpt-4.1-nano | $0.10 | $0.05 | $0.40 |
Here we can see an example of the differences in costs between models, which is significant in terms of model capability and input and output costs per 1 million tokens.
Part 1: Managing System Stability (Rate Limits)
This section focuses on ensuring the application doesn’t fail under load.
Account for All Rate Limits
If a foundation model is being used, there are typically rate limits applied to both input and output tokens. These are also split based on:
- The endpoint/model: For example, an embedding endpoint will have a different limit to a foundation model.
- The region: The model might be deployed in different regions with global limits, regional limits, and model limits.
- The limit type: For example, there are typically different limits such as: RPM (Requests Per Minute), TPM (Tokens Per Minute), and RPD (Requests Per Day).
All of these must be accounted for in the calculations to ensure that when deployed the system does not hit a rate limit unexpectedly. Most models, if the rate limit is hit, will immediately stop responding. For example: if the TPM rate limit is hit in the first 20s of a minute, for the last 40s there will be no traffic accepted until the time resets.
The rate limits operate based on whichever rate limit gets hit first. Example: 100 small requests might hit your RPM limit even if you’ve only used 5% of your TPM. Equally, if the model context window is big enough, you could hit the TPM rate limit with only a few requests a minute if the prompts are large enough.
Types of call must also be accounted for; for example, if data is being embedded before being uploaded to a vector database, this would likely consume a lot of tokens in a spike. This could easily push the rate limit over or block other calls to the embedding endpoint during upload.
Evaluate Consumption
This is a critical test. What happens if a user pastes a 100-page document into your RAG system’s input?
- Unbounded inputs: These must be tested. A good system should have either an input-side token/character limit that rejects the request before it’s sent to the LLM (e.g., “Error: Your input is 12,000 tokens, but the maximum is 8,000.”).
- Output Limits: Additionally, LLM outputs (where possible) should be limited by using the inbuilt token output limits models provide (e.g., 500 tokens per response).
- 429 Observability: Log all errors and look for the specific token rate limit codes, for example: the 429 Error (Too Many Requests). These may be intermittent and hard to spot; therefore, observability is key.
Monitor Pre- and Post-Release Usage
Track your token consumption in staging to forecast your needs in production.
Example Scenario: “A test with 10 virtual users in staging hit a 100,000 TPM limit. This proves that the system cannot scale to the 1,000 users expected at launch.”
Plan for Scalability (The “What If It Works?” Problem)
Do not assume rate limits can be expanded immediately. This is a common failure point. A request to increase a rate limit can take days or weeks. This must be part of your pre-release planning.
Example Scenario: “The app launch was a success, but after 500 users, the system hit its TPM limit and failed for all new users. The team had not requested a limit increase, and the system was down for 48 hours until the rate limit could be increased.”
Mitigation Strategies
- Model Versioning: Model providers often rate limit per model version. Consider if other models could be used (e.g., if 4.1-mini is being used for the main call, consider if 4.1-nano could be used for calls that require less sophistication).
- Regional Backups: Model providers often rate limit per region. Consider having other models ready as a backup in another region that can be expanded to if the rate limit is close.
- Load Balancing: If sharing token limits between multiple production environments, consider which ones have more fluctuation in load and therefore need more headroom than others.
- Semantic Caching: If a user asks a question very similar to one asked 5 minutes ago, serve the cached result instead of hitting the LLM.
- Model Routing: Use a “Small-to-Large” strategy. Route simple queries to a cheap model and complex reasoning to a premium model.
- Exponential Backoff: Ensure the system doesn’t just “retry” immediately, but waits longer each time it fails.
- Creating plans ahead of time for perceived high load periods.
- Using smaller models usually results in higher rate limits.
- Often older models being phased out have lower limits to encourage migration.
Part 2: Managing Business Viability (Cost)
This section focuses on ensuring the LLM based application is financially sustainable.
Account for All Token Types (Observability)
- Record the token usage for every single call.
- Input tokens and output tokens often have different costs.
- Track the cost of different models, including embedding calls, main model calls, cached tokens, and batched tokens.
Track Token Usage Graphically
Pipe your token data into an observability platform (like Grafana, Datadog, etc.) to answer:
- “What is the total cost per day per production environment?”
- “What is the average cost per user query?”
- “Which feature is costing the most money?”
Calculate Cost-per-Feature
Example Scenario: “After tracking costs, it was found that the ‘auto-summarise’ feature, which was rarely used, was responsible for 40% of the total cost. This was because it ran on GPT-4 for every new document, even if no user ever read the summary. The team immediately changed this to be an ‘on-demand’ feature, saving 40% of their bill.”
This also enables less important features to be identified and switched off in emergencies where token usage is too high in the system.
Cost Mitigations
- Prompt Caching: A huge cost saver. Reusing context can be 50-90% cheaper if reused within a certain window.
- Batch Pricing: Many providers offer a Batch API (non-real-time) which is usually 50% cheaper for tasks that don’t need an instant response.
- Feature Flags: Use feature flags to switch off features that may use a significant amount of tokens.
- Using smaller models usually results in lower costs.
- Often older models being phased out have higher costs to encourage migration.
Conclusion
Evaluating token usage is an iterative process. By combining safety design (circuit breakers), rate limit planning (tiers and regions), and cost observability (dashboards), you ensure that your model application remains both stable for users and sustainable for the business.