AI Gateway — agentic plane

NEW

Govern every AI request. Tokens, cost, and risk — one gateway.

Track every token. Cap every budget. Route across 17+ providers behind one OpenAI-compatible endpoint. Apinizer governs your LLM, MCP, and agent traffic on the same runtime that already runs your REST APIs — same audit, same identity, same operators on call.

  • Providers17+ LLMs
  • StandardsOpenAI · MCP · A2A
  • ModalitiesChat · Embed · Audio · Image · Video

Token economics and cost control — first-class

Every prompt, every response, every embedding — counted, attributed, and capped. Set token budgets per project, team, user, or API key, in any time window. The same three-tier permission model (System / Project / Team) that owns REST quotas owns AI spend, so the people who own the workload also own the bill.

  • Live token tracking — input, output, cached, and total — per request
  • Per-window ceilings: minute, hour, day, month, or contract period
  • Per-scope ceilings: user, API key, team, project, or model class
  • Hard caps, soft caps with warnings, and burst windows for short spikes
  • Cost attribution back to a project or cost center — finance gets a line item, not a mystery
  • Auto fall-back to a cheaper model or a cached answer when the budget tips
Tracked per request
input · output · cached · total tokens
Quota windows
minute · hour · day · month · custom
Quota scopes
user · API key · team · project · model
Enforcement
hard cap · soft warning · graceful fallback
Reporting
cost by project · model · team · time range

Cost-aware multi-LLM routing

Write the application once against an OpenAI-compatible endpoint. The gateway decides which model actually answers — based on cost, latency, model class, or a per-prompt classifier. Drop in a cheaper model for summarisation, send tier-one customers to the frontier model, fall back to a self-hosted model when a provider degrades.

  • OpenAI-compatible request and response shape — no client rewrites
  • Weighted routing across cost, latency, success rate, or model class
  • Per-prompt classifier route — cheap model for the easy 80%, frontier for the hard 20%
  • Provider fall-back chains — degrade gracefully when one provider stalls or rate-limits
  • Streaming, function calling, tool use, batch, and file upload — preserved across providers
  • Pinned model versions and canary releases — promote a cheaper model behind a 5% slice first
  • FrontierOpenAI · Anthropic · Gemini · Vertex AI
  • Cloud-hostedBedrock · Azure OpenAI · Databricks · DashScope
  • Open-weightsCohere · Mistral · Llama · DeepSeek · xAI · Cerebras
  • Self-hostedOllama · vLLM · Hugging Face · in-cluster
  • Embeddings & visionthe matching family on each provider

Response cache today — semantic cache coming soon

Skip the token bill on repeat prompts. Apinizer's two-tier Cache fronts LLM responses on an exact-prompt match — a local in-pod tier for sub-millisecond hits, a Hazelcast cluster for cross-pod truth. Same cache the gateway already runs for REST responses, with the same invalidation and the same operator dashboard. Semantic cache, with embedding-similarity matching, is on the roadmap for an upcoming release.

  • Two-tier response cache — local in-pod tier plus the Hazelcast cluster shared with the REST gateway
  • Exact-match keying on prompt + system message + model + tools — never blends user contexts
  • Configurable TTL per route, with stampede protection for hot prompts
  • Per-project, per-model, and per-template invalidation — atomic on redeploy
  • Live hit-rate, token-spend-saved, and latency-saved reports per cache bucket
  • Coming soon: semantic cache with embedding-similarity matching for prompts that mean the same thing
Match (today)
exact prompt + system + model + tools
Tiers
local in-pod · Hazelcast cluster
TTL
per route · stampede-safe
Invalidation
project · model · template · atomic on redeploy
Reports
hit rate · tokens saved · latency saved
Roadmap
semantic similarity match — coming soon

Prompt engineering with guardrails

Prompt templates, system messages, decorators, and few-shot examples live on the gateway — versioned, reviewed, and promoted across environments like any other artifact. Application code stops carrying the prompt; engineering owns the prompt the way it owns the schema.

  • Prompt templates with parameter binding from request, identity, and project context
  • System prompts and prompt decorators applied at the gateway, not in the client
  • Prompt versioning with rollback, A/B canary, and reviewer sign-off
  • Few-shot examples and tool descriptions managed as artifacts
  • Per-environment promotion through APIops — dev to test to prod
yaml
# Prompt template — versioned, reviewed, deployed via APIopsname: support-summaryversion: 3system: |  You are a support summariser for {{project.name}}.  Keep it under 80 words. Cite the ticket id.tools: [ticket.lookup, kb.search]guardrails: [no-pii, no-credentials, on-topic]

Prompt firewall — injection, jailbreak, and data loss

Block the patterns that put regulated AI projects on hold: prompt injection, jailbreak chains, credential and PII exfiltration, off-topic prompts that waste budget, and tool-use abuse. Apinizer runs the guards inline — the bad request never reaches the model.

  • Inline prompt guards — injection, jailbreak, role override, system-prompt extraction
  • Off-topic and cost guards — refuse essays on company time
  • PII detection and redaction on both the prompt and the response
  • Credential and secret patterns blocked before they reach the provider
  • Tool-use guard — only the tools allowed for the calling identity are exposed
  • Custom redaction policies per project — the team owns its own definitions
Inline guards
injection · jailbreak · role override
Data guards
PII · credentials · secrets
Budget guards
off-topic · oversize · loop detection
Tool guard
scoped per identity
Decision
block · redact · downgrade · alert

AI observability — every prompt, every token, every model

The Analytics Engine ingests AI traffic next to REST traffic. One query answers cost-by-team, latency-by-provider, error-rate-by-model, and which prompt template tripped the firewall last night. Operators see token spend in the same dashboard where they see request rate.

  • Token spend by user, project, team, model, and time window
  • Latency, time-to-first-token, and throughput per provider and per model
  • Error rate, retry rate, and timeout rate per provider
  • Tool-use and function-call traces — see the agent's reasoning chain
  • Prompt firewall hits with the decision, the rule, and the offending substring
  • Real-time anomaly detection on token spend and latency — the bill never surprises you twice
  • Cost dashboardstokens · dollars · by project, team, model
  • Latency dashboardsTTFT · p50 · p99 · by provider
  • Reliabilityerror rate · retry rate · timeout rate
  • Securityfirewall hits · redaction count · injection attempts
  • Anomaliesspend spikes · latency spikes · cache miss surges

MCP servers and agent-to-agent governance

Agents talk through the gateway like any other client. Generate Model Context Protocol servers from the APIs you already published, decide which agent can see which tool, and audit every agent-to-agent message — with the same identity surface that fronts your REST and AI traffic.

  • Auto-generate MCP servers from existing REST, SOAP, and gRPC APIs
  • Per-agent identity provisioned in Identity Manager — scoped tokens, not shared API keys
  • Tool-level RBAC — which agent can call which tool, on which project
  • Agent-to-agent (A2A) message audit at the persistence layer — same record as REST
  • Context Mesh — agents consume API data and event streams through one governed surface
  • Per-agent quotas and rate limits — runaway agents cap themselves
MCP
auto-generated from API catalog
Identity
per-agent, scoped, audited
RBAC
tool-level · project-level
A2A
audit + replay at persistence layer
Quotas
per agent · per tool · per task

One gateway, one audit, one runtime

AI Gateway is not a side-car. It is a layer of policies on the same gateway that runs your REST, gRPC, WebSocket, SOAP, and GraphQL traffic. Same identity, same audit, same observability, same operators. There is no second control plane to learn, no second pager rotation, no second invoice.

  • Same gateway runtime — REST, gRPC, WebSocket, SOAP, GraphQL, and AI on one process
  • Same identity surface — OAuth 2.0, OIDC, JWT, mTLS, SAML for humans, agents, and partners
  • Same audit at the persistence layer — bypass rejected at compile time
  • Same three-tier permission model — System, Project, Team — across API and AI
  • Same hot-deploy path — change a prompt or a route without restarting a pod
  • Same Kubernetes posture — self-hosted, air-gap friendly, no data leaves the cluster
  • Runtimeone gateway process · API + AI
  • Identityhumans · agents · partners on one surface
  • Auditpersistence-layer, immutable, replayable
  • RBACSystem / Project / Team — everywhere
  • Deployhot — prompts, routes, models
  • PostureKubernetes-native · air-gap friendly

Use cases

Where teams put it to work.

Stop the AI bill from running away

Token budgets per project. Response cache on the hot path. Cheap-model fall-back when the budget tips. Cost attribution by team and project. The AI line item stops being a surprise, and finance gets a chargeback report instead of a Slack message.

  • Per-project, per-team, and per-user token ceilings
  • Response cache with hit-rate and savings reports (semantic cache on the roadmap)
  • Cheaper-provider fall-back inside a latency target
  • Cost attribution back to the project that ran the workload
  • Monthly chargeback exports for finance

In the box

What's included

The capabilities below are part of the standard install — no add-on SKUs and no separate licenses.

AI traffic types

  • Chat completions — streaming and batch
  • Embeddings — single and batch
  • Audio — transcription, text-to-speech, translation
  • Image — generation, edit, variation
  • Video — generation across supported providers
  • Function calling and tool use
  • Agent-to-Agent (A2A) messages
  • Model Context Protocol (MCP) interactions

Cost & token governance

  • Live token tracking — input, output, cached, total
  • Ceilings per window — minute, hour, day, month, custom
  • Ceilings per scope — user, API key, team, project, model
  • Soft warnings, hard caps, burst windows
  • Auto fall-back to cached answer or cheaper model on budget tip
  • Chargeback exports back to the project that ran the workload

Security & guardrails

  • Prompt injection and jailbreak guards
  • PII detection and redaction on prompts and responses
  • Credential and secret pattern blocking
  • Off-topic and oversize prompt guards
  • Tool-use RBAC per identity
  • Per-project redaction policies
  • Audit trail at the persistence layer

Operability

  • Same gateway runtime as REST, gRPC, and WebSocket
  • Same identity, same audit, same RBAC across API and AI
  • Hot deploy for prompts, routes, and model selection
  • Three-tier permission model (System / Project / Team)
  • Live cost, latency, and reliability dashboards
  • Kubernetes-native, air-gap-friendly deployment

Govern every AI request

Bring tokens, agents, and risk under one control plane.

A 30-minute walkthrough of the Apinizer AI Gateway — token budgets, multi-LLM routing, response cache, prompt firewall, MCP, and AI observability — on a Kubernetes of your choice.