The introduction of ChatGPT two years ago caused sharply increased interest in (and use of) large language models (LLMs), followed by a crush of commercial and open source competition. Now, companies are rushing to develop and deliver applications that use LLM APIs to provide AI functionality.
Companies are finding that AI-based applications, just like conventional applications, have deliverability concerns. For most conventional applications, an API gateway is a vital tool for deliverability. Something very similar is needed for AI applications, but some of the specifics are different. So, a new form of API gateway, called an AI gateway, is coming to the fore. HAProxy is one of the companies pioneering the development and delivery of this new type of gateway.
What’s new with AI gateways?
One key difference between a conventional API gateway and an AI gateway is in the area of rate limiting. API gateways implement rate limiting for requests based on the number of requests per IP address, which contributes to balanced service delivery and is an initial step in limiting the impact of malfunctioning or malicious software.
While AI gateways also require rate limiting, the limits imposed should be based on the API key and the number of tokens used rather than on the requests per IP address. This actually provides a higher level of control than is possible with conventional rate limiting, since the API key and token counts are specific to a given machine, whereas IP addresses don’t always represent a single machine.
Other needs such as data loss prevention, API key management, retry support, and caching are more or less the same. However, the way these similar requirements are implemented introduces some differences, which we'll discuss later.
Implementing an AI gateway in HAProxy Enterprise
In this post, we'll build an AI gateway using HAProxy Enterprise. We'll showcase the steps using OpenAI APIs.
In part 1 of this guide, we'll implement the basics:
Creating a gateway in front of OpenAI and providing rate limiting per API key
In part 2 (coming later), we'll tackle more advanced steps:
Enhancing the configuration for a gateway using multiple API keys, in front of vLLM
Adding API key encryption
Implementing security for personally identifiable information (PII) protection and data quality at the gateway level
To make all this happen, we'll use HAProxy Enterprise with HAProxy Fusion Control Plane (included with HAProxy Enterprise). Together, these products make it easy to implement an AI gateway that's performant and scalable. Let's first review a few details to show you how everything will fit together.
Challenges facing AI applications
We won’t have a full picture of the unique challenges AI applications face until the industry has more experience creating and delivering them at scale. However, at this early stage, some important concerns have already arisen:
Cost control is paramount – Developers usually aren’t (and maybe shouldn’t be) aware of the full expense and large per-token costs of running an LLM model.
API keys get compromised – LLM platforms such as OpenAI let you create different API keys per developer—a basic security measure. These keys, like any other API keys, can still be compromised or stolen. However, the urgency to enforce sophisticated key management and protection isn't quite keeping pace with today's usage trends. HAProxy can help you to bridge the gap.
API key quotas get used up – Some LLM platforms let you set rate limits for tokens. However, these limits are set globally and not per API key. Wherever developer-specific APIs are available, a single developer can use up most or all of your daily token quota.
Security and PII concerns must be addressed – Users' prompts must not include PII, such as social security numbers or credit card information.
You can begin to address these concerns effectively using the AI gateway that we'll show you how to create here, improving your delivery of LLM-powered AI applications.
New to HAProxy Enterprise?
HAProxy is the world’s fastest and most widely used software load balancer and the G2 category leader in API management, container networking, DDoS protection, web application firewall (WAF), and load balancing. HAProxy Enterprise elevates the experience with authoritative support, robust multi-layered security, and centralized management, monitoring, and automation with HAProxy Fusion. HAProxy Enterprise and HAProxy Fusion provide a secure application delivery platform for modern enterprises and applications.
To learn more, contact our sales team for a demonstration or request a free trial.
Key concepts before getting started
The following concepts are necessary to fully understand this guide. You may need to adapt these concepts and the capabilities of HAProxy Enterprise and HAProxy Fusion to your specific use case.
Storing and encrypting API keys
Storing unencrypted API keys on a load balancer is never a good idea, even though you can use them as keys for rate limits and more.
Since we're just implementing the basics needed for an AI gateway in this guide, we'll simply accept a key for OpenAI, hash it, and use the result as a key for rate limiting.
In part 2, we'll demonstrate how to encrypt your keys and create an intermediate key, so even your application developers can't access real API key values—a production-ready approach.
Quotas and global rate limiting
HAProxy includes stick tables that can be used as counters for rate limiting. Our use case requires a scalable active/active AI gateway, so we need all HAProxy instances to aggregate—and be aware of—each other's rates.
Our Global Profiling Engine (GPE), which comes pre-installed and integrated with HAProxy Fusion, provides exactly that. This feature automatically aggregates all token rates across every HAProxy Enterprise instance. GPE will later include ranking capabilities out of the box, enabling it to determine the most-used keys within your organization, as well as a convenient web endpoint for integration with any number of other systems. GPE is also available as a standalone module. Future HAProxy Fusion Control Plane releases will also provide similar metrics within customizable dashboards, enhancing observability for your AI-powered applications.
Uniquely, GPE can be configured to aggregate historical rates. Static token limits are really hard to define correctly in unpredictable environments, so we'll dynamically rate limit users based on their usage. In our example, we will impose rate limits when current usage exceeds twice the 90th percentile of the general usage during the same time period on the previous day. You can implement different and more sophisticated controls in your own AI gateway implementation.
Metrics and statistics
We also want to use HAProxy's extensive logging capabilities to collect API usage metrics. Specifically, we'll log the total amount of prompt tokens and completion tokens consumed by all users.
We have two flexible collection options, which can be used separately or jointly:
Logging the statistics into the standard HAProxy log, then parsing the logs
Asynchronous, real-time funneling of token metrics into an external endpoint such as Grafana, an HTTP endpoint for TimescaleDB, or others (not covered in part 1)
PII protection
We want to detect social security numbers, credit card numbers, and other types of potentially sensitive data.
How to implement HAProxy Enterprise as an AI gateway
This guide describes how to use HAProxy Enterprise, which has built-in API gateway functionality, as an AI gateway instead. The concept is the same, and we'll explore tokens, API key-based limiting, and other differences as they arise.
Step 1: API key implementation
Let’s look at implementing the API key authorization. In this first version of our AI gateway, your users will continue using standard OpenAI keys, but we'll implement additional controls on top of those keys. These include the following:
Denylists to outright block any compromised keys
A quota or a rate limit per key
We never recommend storing unencrypted OpenAI keys (or keys of any kind) on your load balancer, so we'll use hashing. We will only store the hashes on each request. HAProxy will receive the OpenAI key, hash it using SHA-256, and compare the hash to the stored data.
Hashing API keys
Let’s say your requests contain an HTTP Authorization
header with the OpenAI key. We can get the value of the key, apply a SHA-2 256 digest, then return the value to a variable inside HAProxy:
http-request set-var(txn.openai_key_hash) http_auth_bearer(Authorization),sha2(256),hex |
Denying compromised API keys
Let’s say you have a client with an OpenAI key (hashed as 5fd924625a10e0baacdb8
) that’s been compromised and must be blocked at the load balancer layer. We're going to create a .map
file called denylist.acl
with a single compromised API key per line:
# example denylist.acl | |
5fd924625a10e0baacdb8 |
You can generate the hashes for your file with a variety of tools, such as this hashing calculator on GitHub.
For security reasons, HAProxy Enterprise is designed not to use any I/O once loaded. HAProxy Fusion also ensures that HAProxy Enterprise always has an up to date denylist.acl file in memory. When needed, it’s easy to write a script to make changes to this file using HAProxy Fusion’s native API.
You can now block any key on denylist.acl
as follows in your HAProxy configuration:
acl blocked_key var(txn.openai_key_hash) -m -i -f denylist.acl | |
http-request deny deny_status 403 if blocked_key |
And that’s it! Every time you add a hashed key to the denylist, HAProxy will block it from using your service. You can effectively block OpenAI API keys without actually storing unencrypted copies on your instance.
Step 2: Quotas and rate limits
Next, we want to implement quotas or rate limits per key. Enterprise-grade quotas require the following properties:
Quotas or rate limits should be persisted across all load balancer instances, whether they are configured as Active/Active or Active/Passive.
Quotas or rate limits should be flexible and configurable to any specification. In this example, we'll use per-minute limits and daily limits for both prompt tokens and completion tokens. This matches OpenAI's implementation while also letting you control usage per individual API key, instead of only at the account level.
We'll start with a .map
file to define our limits. I’ve chosen the following format:
<key hash> <per minute prompt limit>:<per day prompt limit>:<per minute completion limit>:<per day completion limit> |
Next, let’s look at an example .map
file named rate-limits.map
:
5fd924625a10e0baacdb8 100:200:1000:50000 | |
813490e4ba67813490e4 300:600:2000:30000 |
For illustration, this file really represents a table of limits that looks like this:
Key Hash | Per Minute Prompt Token Limit | Per Day Prompt Limit | Per Minute Completion Limit | Per Day Completion Limit |
| 100 | 200 | 1000 | 50000 |
| 300 | 600 | 2000 | 30000 |
We absolutely want to support quotas or rate limits across all load balancers in an Active/Active configuration. By default, HAProxy Enterprise's stick tables are used for rate limiting per instance—each instance of HAProxy can configure a local stick table. To comply with the above limits (for example 100 prompt tokens per minute), you’d need to set the actual limit to be 100 divided by the number of HAProxy instances in your cluster. However, you can't easily autoscale, and you would need to constantly update calculations. Mathematical errors and other issues would be hard to detect.
It's Global Profiling Engine (GPE) to the rescue! In an Active/Active load balancer configuration, where traffic is spread across two or more load balancers in a round-robin rotation, GPE ensures that each load balancer receives a client's aggregated requests. This is true even when those requests were routed to another load balancer in the cluster.
I'm going to perform a director's cut here: if you're using HAProxy Fusion, you're already using GPE by default. It's automatically configured for each cluster of load balancers you create. If you aren't using HAProxy Fusion, you can easily install the GPE module by following the instructions in our documentation.
Defining stick tables
Let’s focus solely on the configuration and define HAProxy stick tables to hold our prompt rates. We'll need four tables with two variants—or eight total tables. The variants will be local and aggregate. It's important that on the HAProxy instances we'll write (or track) our data into the local tables, yet read from the aggregate tables. The Global Profiling Engine supplies these aggregate tables, which contain all aggregated rates across all instances.
To get started, we'll need two of each of the following:
Per-minute prompt tokens tables
Per-day prompt tokens tables
Per-minute completion tokens tables
Per-day completion tokens tables
Here's a sample definition of all of them:
peers mypeers | |
bind 0.0.0.0:10000 | |
server lb1 | |
server gpe 192.168.50.40:10000 | |
table rates_prompt_minute.local type string len 128 size 50k expire 1m store gpc(1),gpc_rate(1,60s) | |
table rates_prompt_minute.aggregate type string len 128 size 50k expire 1m store gpc(1),gpc_rate(1,60s) | |
table rates_prompt_day.local type string len 128 size 50k expire 24h store gpc(1),gpc_rate(1,1d) | |
table rates_prompt_day.aggregate type string len 128 size 50k expire 24h store gpc(1),gpc_rate(1,1d) | |
table rates_completion_minute.local type string len 128 size 50k expire 1m store gpc(1),gpc_rate(1,60s) | |
table rates_completion_minute.aggregate type string len 128 size 50k expire 1m store gpc(1),gpc_rate(1,60s) | |
table rates_completion_day.local type string len 128 size 50k expire 24h store gpc(1),gpc_rate(1,1d) peers | |
table rates_completion_day.aggregate type string len 128 size 50k expire 24h store gpc(1),gpc_rate(1,1d) |
This seemingly complex set of tables is actually deceptively simple. We're intentionally using more tables to support all notable limits, since OpenAI itself supports both per-minute and per-day limits. Prompt and completion tokens are handled separately.
Two differences exist between our per-minute and per-day tables:
The per-minute table records store rates per minute and expires every 60 seconds.
The per-day tables store daily rates and expire every 24 hours.
Fetching rate limits from the map file
Next, we'll fetch the rate limits for each OpenAI key by looking up each hash in the .map
file, and requesting the fields that we can use to set the maxrate
variables.
The field(1,:)
variable will return the first number from the .map
file for the key, delimited by a colon:
frontend mysite | |
http-request set-var(txn.maxrate_min_prompt) var(txn.openai_key_hash),map(/rate-limits.map,0),field(1,:) | |
http-request set-var(txn.maxrate_day_prompt) var(txn.openai_key_hash),map(rate-limits.map,0),field(2,:) | |
http-request set-var(txn.maxrate_min_completion) var(txn.openai_key_hash),map(rate-limits.map,0),field(3,:) | |
http-request set-var(txn.maxrate_day_completion) var(txn.openai_key_hash),map(rate-limits.map,0),field(4,:) |
If the key isn't in the .map
file, the default result will be zero. That means no requests will be allowed.
Tracking current requests
Once we've fetched our rate limits from the .map
file we must carefully track the current requests using a sticky counter:
http-request track-sc0 var(txn.openai_key_hash) table rates/rates_prompt_minute.local | |
http-request track-sc1 var(txn.openai_key_hash) table rates/rates_prompt_minute.aggregate | |
http-request track-sc2 var(txn.openai_key_hash) table rates/rates_prompt_day.local | |
http-request track-sc3 var(txn.openai_key_hash) table rates/rates_prompt_day.aggregate | |
http-request track-sc4 var(txn.openai_key_hash) table rates/rates_completion_minute.local | |
http-request track-sc5 var(txn.openai_key_hash) table rates/rates_completion_minute.aggregate | |
http-request track-sc6 var(txn.openai_key_hash) table rates/rates_completion_day.local | |
http-request track-sc7 var(txn.openai_key_hash) table rates/rates_completion_day.aggregate |
Getting current request rates
Next, request the current rates from the associated aggregated tables:
http-request set-var(txn.rate_prompt_minute) sc_gpc_rate(0,1) | |
http-request set-var(txn.rate_prompt_day) sc_gpc_rate(0,3) | |
http-request set-var(txn.rate_completion_minute) sc_gpc_rate(0,5) | |
http-request set-var(txn.rate_completion_day) sc_gpc_rate(0,7) |
The second number in sc_gpt_rate(0,X)
refers to the corresponding track-scX
statement from the previous sticky counter. For example, sc_gpc_rate(0,1)
and track-sc1
are coupled.
Calculating if over the limit
Finally, let's determine if the current request exceeds the rate limit we've set. We're effectively making this comparison in pseudo-code:
if (Current rate - Maximum rate <= 0) then | |
Over the limit |
We'll use this calculation for two purposes:
To initially deny the request if the customer exceeds their limit
To add the amount of tokens returned into the stick table after getting a response
Here's how we'll deny requests if they exceed the token limit, which subsequently triggers the 429 Too Many Requests
error code:
http-request deny status 429 if { var(txn.rate_prompt_minute),sub(txn.maxrate_min_prompt) gt 0 } | |
http-request deny status 429 if { var(txn.rate_prompt_day),sub(txn.maxrate_day_prompt) gt 0 } | |
http-request deny status 429 if { var(txn.rate_completion_minute),sub(txn.maxrate_min_completion) gt 0 } | |
http-request deny status 429 if { var(txn.rate_completion_day),sub(txn.maxrate_day_completion) gt 0 } |
Crucially, we don’t actually know yet if the current request will exceed the limit. For performance reasons, we're relying on the OpenAI response itself to tell us how many prompt and completion tokens are consumed. This means these limits are actually eventually consistent.
You could run a tokenizer on each request to improve consistency, but doing so would be slow. Eventually, consistent Active/Active rate limits will work the best.
Getting token usage from JSON responses
We're finally nearing the finish line! The OpenAI HTTP response will contain information about per-request token consumption for prompt and completion. We can use HAProxy’s JSON parser to get the details:
http-response set-var(txn.prompt_tokens) res.body,json_query('$.usage.prompt_tokens','int') | |
http-response set-var(txn.completion_tokens) res.body,json_query('$.usage.completion_tokens','int') |
There are some potential limitations to be mindful of. The above code will only inspect the allocated buffer size (or tune.bufsize
in HAProxy) in bytes of the response. If your prompts generate huge responses, you’ll need to increase your tune.bufsize
to capture the whole body.
There's a direct correlation between the buffer size and the amount of memory your load balancer will consume. Keep a close eye on your resource availability.
Finally, we can add the prompt and completion tokens to our counters—but only if we adhere to the limit, since we'd otherwise never stop counting:
http-request sc-add-gpc(0,0) var(txn.smtp.prompt_tokens) if { var(txn.rate_prompt_minute),sub(txn.maxrate_min_prompt) le 0 } | |
http-request sc-add-gpc(0,2) var(txn.smtp.prompt_tokens) if { var(txn.rate_prompt_day),sub(txn.maxrate_day_prompt) le 0 } | |
http-request sc-add-gpc(0,4) var(txn.smtp.completion_tokens) if { var(txn.rate_prompt_minute),sub(txn.maxrate_min_completion) le 0 } | |
http-request sc-add-gpc(0,6) var(txn.smtp.completion_tokens) if { var(txn.rate_prompt_day),sub(txn.maxrate_day_comp.etion) le 0 } |
Conclusion
In this blog, we've introduced the concept of an AI gateway (quite similar to widely used API gateways) while implementing API key checking and token-based rate limiting. This element is unique to the AI gateway concept, where we care much more about rate limiting based on tokens as opposed to requests. This includes both incoming (prompt) and outgoing (completion) tokens.
In part 2 (coming soon), we'll implement an AI gateway in front of vLLM and/or Ray, while enforcing PII protection and more. Subscribe to our blog to make sure you stay updated!
Subscribe to our blog. Get the latest release updates, tutorials, and deep-dives from HAProxy experts.