Skip to main content
GonkaGate Docs

Rate Limit Handling

How to handle GonkaGate 429 responses, current request and token limits, throttling, and retry budgets safely.

Read error.code before you retry a 429. In GonkaGate, insufficient_quota means the prepaid USD balance is too low for this request. rate_limit_exceeded and transfer_agent_capacity_reached are usually temporary. Keep that branch logic in one shared helper so every caller follows the same retry policy.

Decide whether to retry

If you getWhat it usually meansWhat to do
429 + insufficient_quotaThe prepaid USD balance is too low for this requestStop retrying. Surface balance or top-up state, then retry only after funds are available.
429 + rate_limit_exceededYour traffic hit a request limitHonor Retry-After when present and retry with a small backoff budget.
429 + transfer_agent_capacity_reachedTemporary capacity pressureWait, retry carefully, and keep the retry budget small.
429 without a known codeThis is usually still temporary throttling or capacity pressureTreat it as temporary throttling first, but log the full response details if it repeats.

Current authenticated model-request limits

For POST /v1/chat/completions and other authenticated /v1/* model requests, GonkaGate checks multiple rate-limit buckets. The most restrictive exhausted bucket wins.

ScopeRequest limitBurst and concurrency
Regular API key (gp-...)600 RPM200 requests per 10 seconds and 50 concurrent requests
Regular API key + source IP600 RPM200 requests per 10 seconds
Owning account3,000 RPM1,000 requests per 10 seconds and 200 concurrent requests
Source IP3,000 RPM1,000 requests per 10 seconds
Distinct source IPs for one regular key200 IPs/hourAllows normal multi-region and serverless IP churn
Distinct regular keys from one source IP200 keys/hourApplies when many customer keys exit through the same backend IP

RPM means requests per minute. Burst is a separate 10-second request window.

TPM means estimated tokens per minute. GonkaGate currently tracks token windows for telemetry and rate-limit headers: 2,000,000 TPM for regular API key and key+IP scopes, and 10,000,000 TPM for owning account and source IP scopes. Token windows do not currently block requests by themselves; request, burst, concurrency, distinct, and cooldown checks can return 429 rate_limit_exceeded.

The standard buckets above apply to normal varied traffic. Repeated synthetic requests, especially many minimal or near-identical prompts against the same key, source IP, and model, can trigger a behavioral cooldown even when the RPM, burst, and concurrency buckets are still inside their published limits. Those responses still use 429 rate_limit_exceeded and include Retry-After.

Repeated local request-limit 429 responses can also activate a short cooldown for key+IP, key, or owning account scopes. The default trigger is 3 eligible local request-limit blocks in 60 seconds. Active cooldown responses keep the same 429 rate_limit_exceeded contract and set Retry-After from the cooldown TTL.

These are traffic limits, not spend limits. A per-key USD limit configured through API key management caps spending for that regular key, while the account prepaid USD balance remains shared across the account.

Public playground limits

POST /v1/public/chat/completions uses separate public-traffic guards. Anonymous playground requests are limited to 20 requests per minute and 50 requests per hour per browser principal, and the same 20 requests per minute and 50 requests per hour per source IP. Anonymous public playground concurrency is 2 in-flight requests per principal.

The shared public guard allows 10,000 RPM, 10,000,000 estimated TPM, and 1,000 in-flight requests across public playground traffic. When authenticated user-billed playground mode is enabled, those requests use the authenticated /v1/* API-key buckets above and a 4 in-flight requests per authenticated principal playground guard.

Use one shared retry helper

Use one shared retry helper
const sleep = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));

export async function requestWithRateLimitHandling(
  makeRequest: () => Promise<Response>,
  maxRetries = 3
): Promise<Response> {
  for (let attempt = 0; attempt <= maxRetries; attempt += 1) {
    const response = await makeRequest();

    if (response.status !== 429) {
      return response;
    }

    const body = await response
      .clone()
      .json()
      .catch(() => null);
    const errorCode = body?.error?.code;

    if (errorCode === "insufficient_quota") {
      throw new Error("insufficient_quota");
    }

    if (attempt === maxRetries) {
      throw new Error("retry_budget_exhausted");
    }

    const retryAfterSeconds = Number(response.headers.get("Retry-After") ?? "0");
    const waitMs =
      retryAfterSeconds > 0 ? retryAfterSeconds * 1000 : Math.min(1000 * 2 ** attempt, 8000);

    await sleep(waitMs);
  }

  throw new Error("retry_budget_exhausted");
}

This baseline does four things: branch on error.code, stop on insufficient_quota, honor Retry-After, and cap retries.

Read these fields first

  • HTTP status 429
  • error.code in the JSON body
  • Retry-After when the server tells you exactly how long to wait
  • x-ratelimit-* headers when your client exposes them, so you can log the current limit, remaining allowance, and reset window
  • x-request-id for repeated failures or support escalation

Treat error.message as human-readable context only. Do not build retry logic from the message text.

Common mistakes

  • Treating every 429 as retryable throttling. In GonkaGate, insufficient_quota is a billing state, not a backoff case.
  • Ignoring Retry-After when it is present. That usually creates synchronized retries and more throttling.
  • Hiding insufficient_quota behind automatic retries. Stop and show a billing or top-up state instead.
  • Assuming that many generated gp-... keys each get a separate account-level quota. Per-key buckets are separate, but keys under one GonkaGate account still share that account’s aggregate bucket.
  • Treating a synthetic load test with identical minimal prompts as proof of normal sustained throughput. Use representative varied prompts, or coordinate the test window with support.
  • Letting workers, cron jobs, or batch traffic retry forever. Keep the retry budget small and make interactive traffic the priority.

See also

Was this page helpful?