AI Gateway — agentic plane

NEW

Govern every AI request. Tokens, cost, and risk — one gateway.

Track every token. Cap every budget. Route across 17+ providers behind one OpenAI-compatible endpoint. Apinizer governs your LLM, MCP, and agent traffic on the same runtime that already runs your REST APIs — same audit, same identity, same operators on call.

Request a demo Read the docs

Providers17+ LLMs
StandardsOpenAI · MCP · A2A
ModalitiesChat · Embed · Audio · Image · Video

Token economics and cost control — first-class

Every prompt, every response, every embedding — counted, attributed, and capped. Set token budgets per project, team, user, or API key, in any time window. The same three-tier permission model (System / Project / Team) that owns REST quotas owns AI spend, so the people who own the workload also own the bill.

Live token tracking — input, output, cached, and total — per request
Per-window ceilings: minute, hour, day, month, or contract period
Per-scope ceilings: user, API key, team, project, or model class
Hard caps, soft caps with warnings, and burst windows for short spikes
Cost attribution back to a project or cost center — finance gets a line item, not a mystery
Auto fall-back to a cheaper model or a cached answer when the budget tips

Tracked per request: input · output · cached · total tokens
Quota windows: minute · hour · day · month · custom
Quota scopes: user · API key · team · project · model
Enforcement: hard cap · soft warning · graceful fallback
Reporting: cost by project · model · team · time range

Cost-aware multi-LLM routing

Write the application once against an OpenAI-compatible endpoint. The gateway decides which model actually answers — based on cost, latency, model class, or a per-prompt classifier. Drop in a cheaper model for summarisation, send tier-one customers to the frontier model, fall back to a self-hosted model when a provider degrades.

OpenAI-compatible request and response shape — no client rewrites
Weighted routing across cost, latency, success rate, or model class
Per-prompt classifier route — cheap model for the easy 80%, frontier for the hard 20%
Provider fall-back chains — degrade gracefully when one provider stalls or rate-limits
Streaming, function calling, tool use, batch, and file upload — preserved across providers
Pinned model versions and canary releases — promote a cheaper model behind a 5% slice first

FrontierOpenAI · Anthropic · Gemini · Vertex AI
Cloud-hostedBedrock · Azure OpenAI · Databricks · DashScope
Open-weightsCohere · Mistral · Llama · DeepSeek · xAI · Cerebras
Self-hostedOllama · vLLM · Hugging Face · in-cluster
Embeddings & visionthe matching family on each provider

Response cache today — semantic cache coming soon

Skip the token bill on repeat prompts. Apinizer's two-tier Cache fronts LLM responses on an exact-prompt match — a local in-pod tier for sub-millisecond hits, a Hazelcast cluster for cross-pod truth. Same cache the gateway already runs for REST responses, with the same invalidation and the same operator dashboard. Semantic cache, with embedding-similarity matching, is on the roadmap for an upcoming release.

Two-tier response cache — local in-pod tier plus the Hazelcast cluster shared with the REST gateway
Exact-match keying on prompt + system message + model + tools — never blends user contexts
Configurable TTL per route, with stampede protection for hot prompts
Per-project, per-model, and per-template invalidation — atomic on redeploy
Live hit-rate, token-spend-saved, and latency-saved reports per cache bucket
Coming soon: semantic cache with embedding-similarity matching for prompts that mean the same thing

Match (today): exact prompt + system + model + tools
Tiers: local in-pod · Hazelcast cluster
TTL: per route · stampede-safe
Invalidation: project · model · template · atomic on redeploy
Reports: hit rate · tokens saved · latency saved
Roadmap: semantic similarity match — coming soon

Prompt engineering with guardrails

Prompt templates, system messages, decorators, and few-shot examples live on the gateway — versioned, reviewed, and promoted across environments like any other artifact. Application code stops carrying the prompt; engineering owns the prompt the way it owns the schema.

Prompt templates with parameter binding from request, identity, and project context
System prompts and prompt decorators applied at the gateway, not in the client
Prompt versioning with rollback, A/B canary, and reviewer sign-off
Few-shot examples and tool descriptions managed as artifacts
Per-environment promotion through APIops — dev to test to prod

yaml

# Prompt template — versioned, reviewed, deployed via APIopsname: support-summaryversion: 3system: |  You are a support summariser for {{project.name}}.  Keep it under 80 words. Cite the ticket id.tools: [ticket.lookup, kb.search]guardrails: [no-pii, no-credentials, on-topic]

Prompt firewall — injection, jailbreak, and data loss

Block the patterns that put regulated AI projects on hold: prompt injection, jailbreak chains, credential and PII exfiltration, off-topic prompts that waste budget, and tool-use abuse. Apinizer runs the guards inline — the bad request never reaches the model.

Inline prompt guards — injection, jailbreak, role override, system-prompt extraction
Off-topic and cost guards — refuse essays on company time
PII detection and redaction on both the prompt and the response
Credential and secret patterns blocked before they reach the provider
Tool-use guard — only the tools allowed for the calling identity are exposed
Custom redaction policies per project — the team owns its own definitions

Inline guards: injection · jailbreak · role override
Data guards: PII · credentials · secrets
Budget guards: off-topic · oversize · loop detection
Tool guard: scoped per identity
Decision: block · redact · downgrade · alert

AI observability — every prompt, every token, every model

The Analytics Engine ingests AI traffic next to REST traffic. One query answers cost-by-team, latency-by-provider, error-rate-by-model, and which prompt template tripped the firewall last night. Operators see token spend in the same dashboard where they see request rate.

Token spend by user, project, team, model, and time window
Latency, time-to-first-token, and throughput per provider and per model
Error rate, retry rate, and timeout rate per provider
Tool-use and function-call traces — see the agent's reasoning chain
Prompt firewall hits with the decision, the rule, and the offending substring
Real-time anomaly detection on token spend and latency — the bill never surprises you twice

Cost dashboardstokens · dollars · by project, team, model
Latency dashboardsTTFT · p50 · p99 · by provider
Reliabilityerror rate · retry rate · timeout rate
Securityfirewall hits · redaction count · injection attempts
Anomaliesspend spikes · latency spikes · cache miss surges

MCP servers and agent-to-agent governance

Agents talk through the gateway like any other client. Generate Model Context Protocol servers from the APIs you already published, decide which agent can see which tool, and audit every agent-to-agent message — with the same identity surface that fronts your REST and AI traffic.

Auto-generate MCP servers from existing REST, SOAP, and gRPC APIs
Per-agent identity provisioned in Identity Manager — scoped tokens, not shared API keys
Tool-level RBAC — which agent can call which tool, on which project
Agent-to-agent (A2A) message audit at the persistence layer — same record as REST
Context Mesh — agents consume API data and event streams through one governed surface
Per-agent quotas and rate limits — runaway agents cap themselves

MCP: auto-generated from API catalog
Identity: per-agent, scoped, audited
RBAC: tool-level · project-level
A2A: audit + replay at persistence layer
Quotas: per agent · per tool · per task

One gateway, one audit, one runtime

AI Gateway is not a side-car. It is a layer of policies on the same gateway that runs your REST, gRPC, WebSocket, SOAP, and GraphQL traffic. Same identity, same audit, same observability, same operators. There is no second control plane to learn, no second pager rotation, no second invoice.

Same gateway runtime — REST, gRPC, WebSocket, SOAP, GraphQL, and AI on one process
Same identity surface — OAuth 2.0, OIDC, JWT, mTLS, SAML for humans, agents, and partners
Same audit at the persistence layer — bypass rejected at compile time
Same three-tier permission model — System, Project, Team — across API and AI
Same hot-deploy path — change a prompt or a route without restarting a pod
Same Kubernetes posture — self-hosted, air-gap friendly, no data leaves the cluster

Runtimeone gateway process · API + AI
Identityhumans · agents · partners on one surface
Auditpersistence-layer, immutable, replayable
RBACSystem / Project / Team — everywhere
Deployhot — prompts, routes, models
PostureKubernetes-native · air-gap friendly

Use cases

Where teams put it to work.

Stop the AI bill from running away

Token budgets per project. Response cache on the hot path. Cheap-model fall-back when the budget tips. Cost attribution by team and project. The AI line item stops being a surprise, and finance gets a chargeback report instead of a Slack message.

Per-project, per-team, and per-user token ceilings
Response cache with hit-rate and savings reports (semantic cache on the roadmap)
Cheaper-provider fall-back inside a latency target
Cost attribution back to the project that ran the workload
Monthly chargeback exports for finance

In the box

What's included

The capabilities below are part of the standard install — no add-on SKUs and no separate licenses.

AI traffic types

Chat completions — streaming and batch
Embeddings — single and batch
Audio — transcription, text-to-speech, translation
Image — generation, edit, variation
Video — generation across supported providers
Function calling and tool use
Agent-to-Agent (A2A) messages
Model Context Protocol (MCP) interactions

Cost & token governance

Live token tracking — input, output, cached, total
Ceilings per window — minute, hour, day, month, custom
Ceilings per scope — user, API key, team, project, model
Soft warnings, hard caps, burst windows
Auto fall-back to cached answer or cheaper model on budget tip
Chargeback exports back to the project that ran the workload

Security & guardrails

Prompt injection and jailbreak guards
PII detection and redaction on prompts and responses
Credential and secret pattern blocking
Off-topic and oversize prompt guards
Tool-use RBAC per identity
Per-project redaction policies
Audit trail at the persistence layer

Operability

Same gateway runtime as REST, gRPC, and WebSocket
Same identity, same audit, same RBAC across API and AI
Hot deploy for prompts, routes, and model selection
Three-tier permission model (System / Project / Team)
Live cost, latency, and reliability dashboards
Kubernetes-native, air-gap-friendly deployment

Resources

Keep going

AI Gateway docs

Configure providers, set token budgets, write prompt firewall policies, and observe AI traffic alongside REST.

Read the docs

Cost & token playbook

Patterns for project budgets, chargeback, response cache tuning, and cheap-model fall-back chains.

See the playbook

Provider quickstarts

Drop-in recipes for OpenAI, Anthropic, Bedrock, Azure OpenAI, Gemini, and self-hosted Llama or vLLM.

Browse providers

Prompt firewall reference

The guard catalog — injection, jailbreak, PII, credentials, off-topic, tool-use — with policy snippets.

Read the catalog

AI observability guide

Cost, latency, reliability, and firewall dashboards in the Analytics Engine — one query for API and AI.

Open the guide

Architecture overview

How the AI plane shares one runtime with the API Gateway, Identity Manager, Cache, and Analytics Engine.

See the suite

Migration from a side-car gateway

A short field guide for teams running a dedicated AI gateway today — what to keep, what to retire, what to consolidate.

Read the guide

Govern every AI request

Bring tokens, agents, and risk under one control plane.

A 30-minute walkthrough of the Apinizer AI Gateway — token budgets, multi-LLM routing, response cache, prompt firewall, MCP, and AI observability — on a Kubernetes of your choice.

Request a demo Read the docs