Skip to main content

Best Practices

Optimize your API usage for cost, performance, and reliability.

Cost Optimization

Keep your API costs under control with these strategies:

Real-time monitoring: Track costs via usage.total_cost_usd in the API response. Every response includes the exact request cost.

Low balance warnings: Set up alerts when balance drops below a threshold (e.g., $10) to avoid service interruption.

cost-tracking.ts
// Track costs after each request
let runningBalance = parseFloat(initialBalance);

async function sendMessage(content: string) {
  const response = await client.chat.completions.create({
    model: "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8",
    messages: [{ role: "user", content }],
  });

  const usage = response.usage as any;
  runningBalance -= usage.total_cost_usd;

  if (runningBalance < 10) {
    showLowBalanceWarning();
  }

  return response;
}

Performance

Optimize response times and throughput:

Use streaming for long responses. Streaming reduces perceived latency from 5-10 seconds to instant feedback.

streaming.ts
const stream = await client.chat.completions.create({
  model: "Qwen/Qwen3-8B",
  messages: [{ role: "user", content: "Explain quantum computing" }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Cache repeated queries application-side. Identical prompts yield identical responses at temperature: 0.

Tip

Batch related requests: send multiple messages in a single conversation instead of separate API calls.

Reliability

Build robust applications that handle failures gracefully:

Implement retry with exponential backoff. On 429 (rate limit) or 503 (network) errors, retry with increasing delay.

retry.ts
async function requestWithRetry(fn: () => Promise<any>, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error: any) {
      const status = error?.status;
      if (status === 429 || status === 503) {
        const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
        await new Promise(r => setTimeout(r, delay));
        continue;
      }
      throw error;
    }
  }
  throw new Error("Max retries exceeded");
}

Set reasonable timeouts. For chat completions we recommend 30-60 seconds, for embeddings — 10-15 seconds.

Security

Keep your API keys and data secure:

  • Never expose API keys in frontend code or public repositories
  • Store keys in environment variables, not in code
  • Rotate keys regularly, especially if you suspect a leak
  • Use separate keys for development and production

Error Handling

Handle errors gracefully for a better user experience:

Check HTTP status before parsing the response. 4xx codes indicate client errors, 5xx — temporary service unavailability.

error_handling.py
from openai import OpenAI, APIError

client = OpenAI(base_url="https://api.gonkagate.com/v1", api_key="gp-...")

try:
    response = client.chat.completions.create(
        model="Qwen/Qwen3-8B",
        messages=[{"role": "user", "content": "Hello"}]
    )
except APIError as e:
    if e.status_code == 401:
        print("Invalid API key")
    elif e.status_code == 402:
        print("Insufficient balance")
    elif e.status_code == 429:
        print("Rate limit exceeded")
    elif e.status_code == 503:
        print("Network temporarily unavailable")
    else:
        print(f"Unexpected error: {e}")
Was this page helpful?