Most applications writing against an LLM today pin one provider. The choice happens early — usually around the first prototype — and then it gets hard to change. Pricing shifts, latency drifts, a new model comes out, the regulator asks where prompts are going, and the application team rebuilds the integration.
Apinizer's AI Gateway gives you one OpenAI-compatible endpoint your applications speak to, and routes the request across 17 providers behind the gateway. Switching is a manifest change, not a sprint.
The shape of the contract
Applications keep calling POST /v1/chat/completions against the
gateway. The gateway is responsible for picking a provider, translating
the request to that provider's native format, and translating the
response back.
# routes.yaml — APIops manifest
ai_routes:
- name: chat-mid-tier
match:
model: "gpt-4o-mini"
targets:
- provider: anthropic
model: claude-haiku-4-5
weight: 60
max_latency_ms: 800
- provider: openai
model: gpt-4o-mini
weight: 40
max_latency_ms: 800
fallback: ollama/llama-3.1
The application doesn't change. The route does.
What "OpenAI-compatible facade" actually means
It means three things:
- The request shape coming into the gateway is OpenAI's.
- The response shape going back is OpenAI's.
- Streaming works — including SSE chunked responses.
Anthropic's messages field maps to OpenAI's messages. Bedrock's
inferenceConfig maps to OpenAI's temperature / top_p. Gemini's
safetySettings get filled from the Apinizer policy chain. The gateway
handles all of this; the application stays on the OpenAI SDK it already
uses.
Quotas and audit
Every request goes through the same MessageContext your REST traffic
uses. Per-credential quotas are enforced before the provider call. The
audit trail captures the prompt, the chosen route, the provider, the
token count, and the response time — alongside REST audit data, in the
same Elasticsearch index.
Your operators don't need a second observability stack for AI traffic. The Analytics Engine they're already running picks it up.
Failover that doesn't lose traffic
Failover is policy-driven, not retry-on-error. If the chosen provider returns a 5xx, exceeds the latency target, or hits a quota ceiling, the next provider in the route runs. The application sees one response shape, one timeout, one log line.
Self-hosted fallbacks (vLLM, Ollama, Llama) are first-class — many regulated customers run a self-hosted "last-resort" provider so traffic never leaves the cluster, even if every external provider is down.
What's next
The route format above is stable for 2026.04. We're working on semantic routing — choose a provider per-prompt based on prompt classification, not just per-route — for 2026.09. If you have a use case that wants per-prompt routing, ping us.