Two messages on launch day
Two messages land on the launch-day Slack channel. The first one is “the system is down, users are getting errors.” The second one, usually a few days later, is “we need to talk about the bill.” Both are token problems. They look like different problems because one is measured in HTTP errors and the other is measured in invoices, but the underlying cause is the same. Token usage went somewhere the team had not modelled.
This article is about modelling it before launch day, and observing it after.
What it is
Model providers have specific token rate limits and costs, and both are tied to the same underlying unit. Token usage drives both system stability and business cost. In 2026, those limits are increasingly tiered based on payment history and usage reputation, which means the limits you hit on day one are often not the limits the system needs.
Why it matters
Model providers do not measure usage in compute. They measure it in tokens. Every transaction counts, on both the input and output sides, with different limits and costs depending on which type of token is involved. Get the modelling wrong and the system becomes either unstable, expensive, or both. This article covers how to measure and control both.
Strategy summary
The approach splits into system stability and token expenditure. The strategy applies to both, and the design principles below sit above both.
Design and safety first
- The agentic circuit breaker. Any system where an LLM can trigger a follow-up call, whether that is an agent, a multi-step chain, or a recursive tool use loop, must have a hard turn limit. Five turns is a sensible default. Without it, a single malformed response can trigger an infinite, expensive recursion.
- Hard boundaries. Implement input-side token limits. If your context window or budget allows for 8,000 tokens, reject a 10,000-token PDF before it hits the API. This saves money and prevents 429 errors at the same time.
Pre-release
- Calculate token rate limits, accounting for the different limits that apply to different models and endpoints.
- Calculate token costs, accounting for the different prices on different models and different token types: input, output, cached.
- Monitor token usage as part of the standard pre-release evaluation.
- Evaluate for unbounded consumption, especially in agentic actions or any flow where the LLM can produce large outputs.
- Do not assume rate limits can be expanded immediately. Most providers attach conditions to limit increases, and some have hard ceilings.
- If the planned release is close to the current rate limits, treat that as a release risk and either delay or stand up additional resource.
Post-release
- Observability. Record the token usage on every call.
- Account for all token types. Embedding models, main model calls, cached tokens, batched tokens, input, output. All of them.
- Structured storage. Store the usage data somewhere it can persist over time and be retrieved for analysis.
- Visual monitoring. Plot token usage and compare against the pre-release numbers. Anomalies are easier to see on a chart than in a log.
Examples of rate limits
Tier-based rate limit expansion
Rate limit increases are rarely a matter of paying more. Most providers gate the higher tiers behind specific account milestones.
Rate limit tiers
| Tier | Qualification | Usage limits (approx) |
|---|---|---|
| Free | User must be in an allowed geography | $100 / month |
| Tier 1 | $5 minimum deposit | $100 / month |
| Tier 2 | $50 paid + 7 days since first successful payment | $500 / month |
| Tier 3 | $100 paid + 7 days since first successful payment | $1,000 / month |
| Tier 4 | $250 paid + 14 days since first successful payment | $5,000 / month |
| Tier 5 | $1,000 paid + 30 days since first successful payment | $200,000 / month |
Model-specific limits
The table below shows a relatively simple example. Real rate limiting on production services often gets considerably more involved.
Model-specific rate limits
| Model | Token limits (TPM) | Request limits (RPM) | Batch queue limits |
|---|---|---|---|
| gpt-5.1 | 500,000 | 500 | 900,000 TPD |
| gpt-5-mini | 500,000 | 500 | 5,000,000 TPD |
| gpt-4.1-mini | 200,000 | 500 | 2,000,000 TPD |
Example calculation: If gpt-4.1-mini is used and the average prompt input plus output is 10,000 tokens per request, that allows 20 requests per minute before hitting the TPM limit. If the average request is 30 tokens, the TPM limit allows over 6,000 requests per minute, but the 500 RPM limit becomes the binding constraint instead.
Examples of cost (per 1 million tokens)
Token costs by model
| Model | Input cost | Cached input | Output cost |
|---|---|---|---|
| gpt-4.1 | $2.00 | $1.00 | $8.00 |
| gpt-4.1-mini | $0.40 | $0.20 | $1.60 |
| gpt-4.1-nano | $0.10 | $0.05 | $0.40 |
The differences across models are significant. Output tokens are usually around four times the cost of input tokens, and a tier jump in model capability can mean a tenfold jump in cost.
Part 1: managing system stability
This half is about making sure the application does not fail under load.
Account for all rate limits
If a foundation model is being used, rate limits typically apply to both input and output tokens, and they are split along several dimensions:
- Endpoint and model. An embedding endpoint will have a different limit to a foundation model.
- Region. The same model deployed in different regions can have global limits, regional limits, and model-specific limits stacked on top of each other.
- Limit type. RPM (requests per minute), TPM (tokens per minute), RPD (requests per day), and others.
All of them have to be accounted for. Most models, when a rate limit is hit, stop responding immediately. If the TPM limit is hit in the first 20 seconds of a minute, the remaining 40 seconds reject everything until the window resets.
The binding constraint is whichever limit gets hit first. A flood of small requests can hit the RPM limit while the TPM is at 5%. A handful of very large requests can hit the TPM with only a few calls per minute. Both happen in production, often within the same week.
The type of call matters too. Embedding a large document into a vector database will spike token consumption, which can either push the embedding endpoint over its own limit or starve other calls that share it.
Evaluate consumption
This is the test that catches the worst surprises. What happens when a user pastes a 100-page document into the input?
- Unbounded inputs need to be tested explicitly. A good system rejects oversized requests on the input side before they reach the LLM, with a clear error like “Your input is 12,000 tokens, the maximum is 8,000.”
- Output limits should be set. Most models support a max output token parameter. Use it. 500 tokens is a reasonable starting point for short-response systems.
- 429 errors need observability. Log them, watch for the specific rate limit codes, and treat intermittent ones as signal rather than noise. A 429 that happens once a day under low load is the early warning of a 429 storm at peak.
Monitor pre-release and post-release usage
Track token consumption in staging to forecast production needs.
Example scenario: “A test with 10 virtual users in staging hit a 100,000 TPM limit. The system cannot scale to the 1,000 users expected at launch without a rate limit increase or an architectural change.”
That kind of finding only works if pre-release testing exercises the system at representative volume. Token consumption per user is the unit. Multiply it out and compare to the limit before launch, not after.
Plan for scalability
Do not assume rate limits can be expanded immediately. This is one of the most common production failures. A request to increase a rate limit can take days, sometimes weeks. It needs to be part of pre-release planning, not a recovery option.
Example scenario: “The launch was a success. After 500 users, the system hit its TPM limit and failed for all new users. The team had not requested a limit increase, and the system was down for 48 hours.”
Mitigation strategies
- Model versioning. Providers often rate limit per model version. If 4.1-mini is doing the heavy lifting, consider whether 4.1-nano could handle the simpler calls and free up the main quota.
- Regional backups. Providers often rate limit per region. Standing up a backup deployment in a second region gives a fallback when the primary is close to its ceiling.
- Load balancing. If a token quota is shared across multiple production environments, the environments with the most variable load need the most headroom.
- Semantic caching. If a user asks a question that closely matches one asked five minutes ago, serve the cached result instead of hitting the model.
- Model routing. Route simple queries to a cheap model and reserve the premium model for complex reasoning. The “small to large” pattern.
- Exponential backoff. A retry strategy that waits longer on each failure, rather than retrying immediately and amplifying the problem.
- Plan ahead for known peaks. Marketing campaigns, product launches, end-of-month batches.
- Smaller models usually have higher rate limits.
- Older models being phased out usually have lower rate limits, designed to push migration.
Part 2: managing business viability
The first half kept the system standing. This half keeps it affordable.
Account for all token types
- Record token usage on every call.
- Input tokens and output tokens have different costs, often by a factor of three or four.
- Track usage across all token types: embeddings, main model calls, cached tokens, batched tokens.
The fundamentals are the same as the observability strategy laid out in Observability, with cost tracking layered on top.
Track token usage graphically
Pipe the token data into an observability platform such as Grafana or Datadog. The questions worth answering on a dashboard are:
- What is the total cost per day, per production environment?
- What is the average cost per user query?
- Which feature is costing the most money?
Calculate cost per feature
Example scenario: “Cost tracking showed that the ‘auto-summarise’ feature, which was rarely used, was responsible for 40% of the total bill. It was running on GPT-4 for every new document, even when no user ever read the summary. The team moved it to on-demand and cut 40% of the cost overnight.”
The same dashboard makes it possible to identify the lowest-value features quickly, which matters in two situations: long-term cost optimisation, and emergency throttling when usage spikes unexpectedly.
Cost mitigations
- Prompt caching. A significant cost saver. Reusing context can be 50% to 90% cheaper if it falls within the cache window.
- Batch pricing. Many providers offer a batch API at roughly 50% of real-time pricing, for tasks that do not need instant responses.
- Feature flags. Use them to switch off the most token-hungry features during cost spikes or incident response.
- Smaller models usually cost less.
- Older models being phased out usually cost more, again designed to push migration.
Conclusion
Token management is not a single decision. It is an iterative process that combines safety design, rate limit planning, and cost observability. Done early, it keeps the system stable for users and sustainable for the business. Done after the fact, it means the team is making both decisions in the middle of an incident, with a Slack channel full of red dots and a bill that has already been spent.
Final Thoughts
Back to the two messages on launch day. With this strategy in place, both messages still get sent occasionally, because no production system runs perfectly under all loads. But they get sent earlier, with numbers attached. The system-down message comes from a dashboard alert hitting 80% of TPM, not from users hitting 429s. The cost message comes from the daily spend chart trending upward, not from the monthly invoice. Both problems become forecasts instead of incidents, and that is the only meaningful difference between a system that scales and one that survives.