Rate Limit Handling
How to handle GonkaGate 429 responses, current request and token limits, throttling, and retry budgets safely.
Read error.code before you retry a 429.
In GonkaGate, insufficient_quota means the prepaid USD balance is too low for this request. rate_limit_exceeded and transfer_agent_capacity_reached are usually temporary. Keep that branch logic in one shared helper so every caller follows the same retry policy.
Decide whether to retry
| If you get | What it usually means | What to do |
|---|---|---|
429 + insufficient_quota | The prepaid USD balance is too low for this request | Stop retrying. Surface balance or top-up state, then retry only after funds are available. |
429 + rate_limit_exceeded | Your traffic hit a request limit | Honor Retry-After when present and retry with a small backoff budget. |
429 + transfer_agent_capacity_reached | Temporary capacity pressure | Wait, retry carefully, and keep the retry budget small. |
429 without a known code | This is usually still temporary throttling or capacity pressure | Treat it as temporary throttling first, but log the full response details if it repeats. |
Current authenticated model-request limits
For POST /v1/chat/completions and other authenticated /v1/* model requests, GonkaGate checks multiple rate-limit buckets. The most restrictive exhausted bucket wins.
| Scope | Request limit | Burst and concurrency |
|---|---|---|
Regular API key (gp-...) | 600 RPM | 200 requests per 10 seconds and 50 concurrent requests |
| Regular API key + source IP | 600 RPM | 200 requests per 10 seconds |
| Owning account | 3,000 RPM | 1,000 requests per 10 seconds and 200 concurrent requests |
| Source IP | 3,000 RPM | 1,000 requests per 10 seconds |
| Distinct source IPs for one regular key | 200 IPs/hour | Allows normal multi-region and serverless IP churn |
| Distinct regular keys from one source IP | 200 keys/hour | Applies when many customer keys exit through the same backend IP |
RPM means requests per minute. Burst is a separate 10-second request window.
TPM means estimated tokens per minute. GonkaGate currently tracks token windows for telemetry and rate-limit headers: 2,000,000 TPM for regular API key and key+IP scopes, and 10,000,000 TPM for owning account and source IP scopes. Token windows do not currently block requests by themselves; request, burst, concurrency, distinct, and cooldown checks can return 429 rate_limit_exceeded.
The standard buckets above apply to normal varied traffic. Repeated synthetic requests, especially many minimal or near-identical prompts against the same key, source IP, and model, can trigger a behavioral cooldown even when the RPM, burst, and concurrency buckets are still inside their published limits. Those responses still use 429 rate_limit_exceeded and include Retry-After.
Repeated local request-limit 429 responses can also activate a short cooldown for key+IP, key, or owning account scopes. The default trigger is 3 eligible local request-limit blocks in 60 seconds. Active cooldown responses keep the same 429 rate_limit_exceeded contract and set Retry-After from the cooldown TTL.
These are traffic limits, not spend limits. A per-key USD limit configured through API key management caps spending for that regular key, while the account prepaid USD balance remains shared across the account.
Public playground limits
POST /v1/public/chat/completions uses separate public-traffic guards. Anonymous playground requests are limited to 20 requests per minute and 50 requests per hour per browser principal, and the same 20 requests per minute and 50 requests per hour per source IP. Anonymous public playground concurrency is 2 in-flight requests per principal.
The shared public guard allows 10,000 RPM, 10,000,000 estimated TPM, and 1,000 in-flight requests across public playground traffic. When authenticated user-billed playground mode is enabled, those requests use the authenticated /v1/* API-key buckets above and a 4 in-flight requests per authenticated principal playground guard.
Use one shared retry helper
const sleep = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));
export async function requestWithRateLimitHandling(
makeRequest: () => Promise<Response>,
maxRetries = 3
): Promise<Response> {
for (let attempt = 0; attempt <= maxRetries; attempt += 1) {
const response = await makeRequest();
if (response.status !== 429) {
return response;
}
const body = await response
.clone()
.json()
.catch(() => null);
const errorCode = body?.error?.code;
if (errorCode === "insufficient_quota") {
throw new Error("insufficient_quota");
}
if (attempt === maxRetries) {
throw new Error("retry_budget_exhausted");
}
const retryAfterSeconds = Number(response.headers.get("Retry-After") ?? "0");
const waitMs =
retryAfterSeconds > 0 ? retryAfterSeconds * 1000 : Math.min(1000 * 2 ** attempt, 8000);
await sleep(waitMs);
}
throw new Error("retry_budget_exhausted");
}This baseline does four things: branch on error.code, stop on insufficient_quota, honor Retry-After, and cap retries.
Read these fields first
- HTTP status
429 error.codein the JSON bodyRetry-Afterwhen the server tells you exactly how long to waitx-ratelimit-*headers when your client exposes them, so you can log the current limit, remaining allowance, and reset windowx-request-idfor repeated failures or support escalation
Treat error.message as human-readable context only. Do not build retry logic from the message text.
Common mistakes
- Treating every
429as retryable throttling. In GonkaGate,insufficient_quotais a billing state, not a backoff case. - Ignoring
Retry-Afterwhen it is present. That usually creates synchronized retries and more throttling. - Hiding
insufficient_quotabehind automatic retries. Stop and show a billing or top-up state instead. - Assuming that many generated
gp-...keys each get a separate account-level quota. Per-key buckets are separate, but keys under one GonkaGate account still share that account’s aggregate bucket. - Treating a synthetic load test with identical minimal prompts as proof of normal sustained throughput. Use representative varied prompts, or coordinate the test window with support.
- Letting workers, cron jobs, or batch traffic retry forever. Keep the retry budget small and make interactive traffic the priority.
See also
- Management API Keys for creating regular
gp-...keys programmatically and setting per-key USD spend limits. - GonkaGate API Error Handling for the same retry-or-stop logic across
401,403,5xx, and other non-429failures. - Pricing for prepaid USD balance rules behind
insufficient_quota. - Create a chat completion for the exact
POST /v1/chat/completionsrequest and response contract.