Amit Kothari CEO of Tallyfy, AI advisor at Blue Sheen

API gateway pattern for AI applications

In brief

Traditional API gateways count requests and measure response times, but AI applications need token-based rate limiting, multi-model routing, and granular cost attribution that tools like Kong Gateway and Apache APISIX now provide. With many enterprise AI projects getting cancelled over runaway costs, the API gateway pattern is essential for production AI workloads.

Amit Kothari Follow 10k+

Nov 4, 2025 · Updated Jun 22, 2026 · AI

CEO of Tallyfy · AI advisor at Blue Sheen for mid-size companies

API gateway pattern for AI applications

Key takeaways

Traditional gateways miss AI requirements - Request counting fails when costs vary by tokens, models charge different rates, and responses arrive at unpredictable speeds
Token tracking is non-negotiable - Without token-based rate limiting and cost attribution, you'll burn through your budget before you notice anything is wrong
Multi-model fallback keeps production running - When your primary model hits limits or fails, automatic routing to backup models prevents user-facing errors
Observability needs differ fundamentally - AI gateways track token usage, model performance, cache hit rates, and cost per user instead of traditional API metrics

API gateways work fine for REST APIs. Connect it to an LLM and watch the costs spiral while your monitoring shows nothing useful.

That’s the problem. Most teams discover it after spending money they didn’t plan to spend.

Enterprise AI agent adoption is surging, but the forecast is sobering: a large share of those projects get cancelled over runaway costs and complexity. Most of those teams will spend real money learning what traditional gateways can’t handle. A proper LLMOps discipline makes the gateway a core component, not an afterthought.

Why traditional gateways fail with AI

Traditional API gateways count requests. They rate limit by calls per minute, track response times, and flag error rates. That basically made sense when each request cost roughly the same and took similar time to process.

AI breaks all of these assumptions.

A single LLM request can consume 10 tokens or 10,000 tokens. The first request might cost a fraction of a cent. The second? Several dollars. Token consumption can vary by over 1000x between requests to the same endpoint.

Response times are just as unpredictable. A short answer takes 200 milliseconds. A detailed analysis runs 30 seconds or more. Traditional timeout settings either fail too fast or wait forever.

One user makes 10 requests, hits your rate limit, but consumed only 100 tokens total. Another makes 3 requests and burns through your entire daily budget with massive context windows. Your gateway can’t tell the difference, and it won’t try to.

The cost tracking problem

Turns out, running AI without token tracking is like paying for electricity without a meter. You find out the damage when the bill arrives.

Organizations try to manage LLM costs with traditional monitoring. It tells them nothing useful about who’s spending what, or why. TrueFoundry’s breakdown captures this problem well.

So what does proper tracking actually require? The AI gateway pattern needs to count tokens, not requests. It needs to track costs per user, per feature, per team. That means intercepting every request, parsing the prompt to count input tokens, reading the response to count output tokens, and multiplying by the current rate for that specific model.

Different models charge different rates. OpenAI’s GPT-5.5 costs more than GPT-5.4 mini. Claude Opus costs more than Claude Haiku. Your gateway needs to know which model handled each request and apply the right pricing.

Langfuse’s token tracking shows what production-ready cost management looks like. With 7M+ monthly SDK installs and status as the most-used open-source LLM observability tool, they’ve become a solid reference. They track tokens at the request level, aggregate by user and feature, and provide daily metrics for showback and chargeback. Without this, you can’t answer the basic question: which feature is burning through your AI budget?

One caveat the tracking pitch skips: the log table you are filling is the part that breaks first. Run a self-hosted gateway like LiteLLM at volume and its own users report that past about a million rows in the spend-logs database the write path starts throttling live inference. At a hundred thousand requests a day you reach that in roughly ten days, and the fix on offer is to disable the logging or move it elsewhere, which trades away the cost dashboard you stood the gateway up for. The observability you wanted is the first thing to bottleneck, so size its datastore deliberately rather than meeting the ceiling in production.

If you want help shaping the actual implementation, Blue Sheen runs engagements like this.

Multi-model routing and fallbacks

Without fallbacks: you call OpenAI’s API, it returns a rate limit error, your application shows an error to the user.

With proper multi-model routing, the same scenario plays out differently. Your gateway automatically retries with Anthropic. User sees nothing. You stay online.

Portkey’s fallback patterns explain the implementation. Define your primary model, list fallback options in order, set retry logic and circuit breakers, and let the gateway handle failures automatically.

This gets more useful when you optimize for cost and performance at the same time. Route simple queries to faster, cheaper models. Send complex requests to more capable ones. If the expensive model is unavailable, fall back to the cheaper option instead of failing. Does every app need this? Not all, but anything customer-facing does.

The Apache APISIX team documented how they handle multi-provider routing: proxying requests to OpenAI, Anthropic, Mistral, and self-hosted models through a single endpoint, with consistent authentication and rate limiting across all providers, and unified observability regardless of which model processed the request. Load balancing also helps when you’re hitting rate limits. Split traffic across multiple API keys for the same provider, distribute requests across models with similar capabilities, and route to different regions based on latency.

Security and audit requirements

API key management for AI gets painful fast. Each developer needs keys for testing. Each environment needs different keys. Each customer might need isolated keys for compliance.

I think most teams underestimate this part until something actually goes wrong. Storing keys in application code is obviously wrong. Environment variables are barely better. API gateway security patterns show that proper key management means storing credentials in a secure vault, rotating them regularly, using the gateway to inject keys at request time, and never exposing raw keys to client applications.

Data privacy matters more with AI than with traditional APIs. Every prompt you send potentially contains sensitive information. Every response might include data you shouldn’t cache or log. The gateway needs to sanitize logs, remove personally identifiable information before storage, enforce data residency rules, and support compliance requirements like GDPR and HIPAA without making developers implement these controls in every application.

Audit logging becomes critical here. Who made which request? What data did they send? Which model processed it? How long was the response cached? These questions come up during security reviews and compliance audits. Your gateway should answer them without you digging through application logs.

There is a sharper point the gateway pitch never makes about itself. The box you insert to control and inspect AI traffic is also third-party code running with access to everything passing through it, the same risk I describe for an unreviewed MCP server one layer up. In March 2026 it stopped being hypothetical. Two malicious releases of LiteLLM, v1.82.7 and v1.82.8, went up on PyPI after a poisoned scanner in the project’s own build pipeline leaked its publish token, and the payload harvested SSH keys, cloud credentials, and Kubernetes secrets while installing a persistence backdoor. There was no CVE to search for, because a compromised publisher account is a supply-chain incident, not a code flaw. For the teams that pulled those versions, the gateway they added to make Claude safer became the thing reading their secrets. The lesson is not to avoid one. It is to treat the gateway as the high-value target it is: pin the version, watch the advisories, and scope what its host can reach.

What the production pattern looks like

Real implementations from Apache APISIX users show a consistent approach. Companies like Lenovo, Airwallex, and iQIYI use API gateways to manage traffic at scale, and the same patterns extend to AI workloads with different configuration.

AI gateway architecture showing token rate limiting, multi-model routing with fallback, cost attribution, and observability

Marco Palladino’s Kong Gateway offers token-based rate limiting that counts tokens instead of requests. Their implementation pulls token data directly from LLM provider responses, supports limits by hour, day, week, or month, and handles different limits for different models.

Azure API Management shows what an enterprise-grade AI gateway looks like in practice. They manage multiple AI backends from a single gateway, implement semantic caching to reduce duplicate requests, provide built-in token metrics and cost tracking, and integrate with existing API management workflows.

Semantic caching is worth prioritizing. Semantic caching systems demonstrate 61-69% cache hit rates in benchmark testing, which makes caching a real cost opportunity. Multi-tier caching architectures combining semantic caching with provider-level prompt caching can reduce costs by 80% or more. Anthropic’s prompt caching alone offers up to 90% cost reduction for long prompts. OpenAI’s automatic caching delivers 50% savings by default. Not bad for doing nothing.

The self-hosted versus managed decision depends on your priorities. Self-hosted gives you complete control over data routing and security policies. Managed solutions cut operational overhead but lock you into a single vendor’s platform.

Observability comes first. If you can’t see token usage, model performance, and cost attribution, you can’t optimize anything. 89% of teams running AI agents have implemented observability, outpacing evaluation adoption at 52%. What good observability for AI gateways covers: token consumption per request, cost per user and feature, cache hit rates, model latency and error rates, and how often fallbacks actually trigger.

Do you actually need one yet?

Everything above assumes the gateway earns its place, and across several models and a few hundred engineers it does. For a smaller or single-provider shop the real answer is often not yet. If you call one provider with no concrete plan to add another, a direct SDK call behind a thin internal router is simpler, cheaper, and one fewer thing to run, monitor, and patch. The cleanest trigger I know is a line of code: the day you write your first if provider == "anthropic" branch to handle routing or fallback is the day a real gateway starts paying for itself, and rarely before. Below that, a couple of hundred lines of your own routing is sound engineering, not debt. A gateway also earns its keep the moment you are standing one up anyway, to answer an NTLM proxy or to inspect and log AI traffic, which is exactly why it comes up behind a corporate TLS-inspecting proxy. The trap is not skipping it too long. It is adopting one reflexively and carrying its failure modes, the logging bottleneck and the supply-chain surface above, for a problem you did not yet have.

Can you skip the gateway? Once you are past one provider and real volume, not in production, where you will spend six months recovering from a bill that could have been a dashboard alert. Just adopt it knowing it is infrastructure to run and secure, not a box that makes the problem disappear.

api-gatewayai-infrastructurecost-managementllm-orchestration

About the Author

Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience, he is the Co-Founder & CEO of Tallyfy® (raised $3.6m, the Workflow Made Easy® platform) and Partner at Blue Sheen, an AI advisory firm for mid-size companies. He helps companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding. Read Amit's full bio →

Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.

Contact me More about me

View All Posts »

LLMOps is more Ops than LLM

LLMOps success depends more on proven operations discipline than AI-specific tooling. With a large share of agentic AI projects facing cancellation in the next few years, the teams that survive apply Google SRE principles to LLM infrastructure rather than treating it as something that needs special handling.

Claude is allowed in regulated finance, but it has no EU data residency

Two objections kill most regulated-finance AI conversations before they start. The first, that Anthropic does not permit Claude for regulated work, is false: Claude for Financial Services exists, banks run it, and the usage policy names finance high-risk, not forbidden. The second is real and almost nobody states it plainly: first-party Claude Enterprise has no EU data residency at all. There is no "eu" inference region and workspace storage is US-only. If you are FCA-regulated, that is the fact to design around, and the only EU route runs through a hyperscaler.

Your locked-down Claude sandbox is a holding pattern, not a destination

Giving everyone Claude inside an isolated VM, no sensitive data allowed, feels like the safe way to start. It is a fine way to start. The trouble is what happens when you leave people there: the leak it was built to stop walks out by copy-paste anyway, the friction recruits the shadow AI you were trying to prevent, and the value never compounds because nothing in an ephemeral box survives the session. A sandbox is a scaffold. Scaffolds come down.

An MCP server is unreviewed code with your file system in scope

Treat every MCP server as untrusted code that runs with the access your agent has, because that is what it is. Anthropic docs say the directory lists connectors but does not security-audit them. A registry of approved servers with nothing enforcing it is a memo. The control that binds is a managed allowlist matched by URL or command, never by name.

Your Claude Code deny rules are not a security boundary

Before you hand Claude Code to hundreds of people you add deny rules for .env and credentials and feel locked down. You are not. Those rules govern Claude own tools, not a Python one-liner that opens the same file, and the control that actually holds, the OS sandbox, reads your whole machine by default and fails open when it cannot start. The baseline worth setting is real. Its dangerous gaps are the defaults you never changed.

How to schedule Claude Code on your own machine

You want a Claude job to run every few hours on your Mac, not in the cloud. A cloud routine cannot do it, because it never touches your machine. Here are the local options that can, why launchd beats cron for this, and a working LaunchAgent that pulls every one of my repos on a schedule.