The previous piece in this series argued that prohibition is not a strategy. It moves risk into the shadows; it does not remove it. The conclusion was uncomfortable but simple: if you don’t know which models your employees are using, with what data, under which constraints, you don’t have a policy — you have a liability waiver.

That leaves an obvious question. What does the alternative actually look like?

Not as a slogan (“governance”), not as a vendor category (“AI gateway”), but as architecture. What boxes do you draw, what sits in front of what, what does the request path actually look like the moment a developer types something into Cursor or a marketer pastes a contract into a chat?

This piece is the architecture sketch. It is opinionated because the choices behind a control plane are not interchangeable, and most of the trade-offs only become visible once you’ve tried to run one in production. It is also light on code on purpose — the value is the shape of the system, not the lines that happen to implement it today.


What the infrastructure has to do

Strip the marketing off and the requirement is short. A usable control plane for enterprise AI must guarantee four properties at the same time:

1. Coverage. Every LLM request from any sanctioned tool — chat, IDE plugin, internal script, agent — passes through the same code path. If there is a way around it, that way will be the only one used.

2. Identity. Every request is bound to a real human (or a clearly identified service), authenticated against the same identity provider as the rest of the company. No anonymous tokens, no shared keys, no “team accounts”.

3. Inspection. The content of the request is observable to a policy engine before it leaves the perimeter, and the decisions of that engine are recorded in a way that survives an audit.

4. Isolation and accounting. Each user’s context, budget and history is separated from every other user’s, and the cost of every call is attributed in real time to the human who triggered it.

These are not new ideas. They are exactly what enterprise IT figured out for the web in the 2000s — proxy, SSO, DLP, billing. The only thing that has changed is that the payload is now natural language and the cost of a single bad request can be measured in tokens, in privacy violations, or in a regulator’s notice.

Everything below is about how to satisfy those four properties without breaking the developer experience that made shadow AI so attractive in the first place. Because if your sanctioned tool is worse than the unsanctioned one, employees will go back to the Chrome tab and you will have built a museum.


The shape of the system

Before going into the reasoning, here is the picture. Five layers, in the order a request meets them:

   user / IDE / script / agent
              |
              v
   +----------------------+
   |  Identity provider   |   SSO, OIDC, your existing IdP
   +----------------------+
              |
              v
   +----------------------+
   |      The proxy       |   one entry point, OpenAI-compatible API
   | +------------------+ |
   | | auth + identity  | |   who is calling
   | +------------------+ |
   | | budget check     | |   are they allowed to spend
   | +------------------+ |
   | | the Guard        | |   what is in the request
   | +------------------+ |
   | | router           | |   which model gets it
   | +------------------+ |
   +----------------------+
              |
              v
   +----------------------+
   |   Model gateway      |   OpenAI-compatible multiplexer
   +----------------------+
              |
        +-----+------+-----------+
        v            v           v
   sovereign     EU-hosted    open-source
   open-source   frontier     in TEE
   (Mistral,     (Claude,     (tomorrow)
    Qwen...)      GPT...)

Five things go through that path on every request: authenticate, check budget, inspect, route, forward. In that order. Always. Whether the caller is a human in a browser or a script running in CI, the sequence is identical.

The rest of the article is the why of each layer. None of the choices are obvious; most of them have a wrong version that looks attractive at first.


The interface decision: speak OpenAI, even if you don’t use OpenAI

The first design choice is also the most consequential, and it is almost always made wrong.

The temptation, when you build a corporate AI platform, is to expose your own SDK. A custom client, a custom endpoint, a custom auth dance, a custom streaming format. It feels professional. It is, in practice, the fastest way to guarantee no one will use it.

The right choice is to expose an OpenAI-compatible API. Not because OpenAI deserves to be a standard — they happen to be one. Every IDE plugin, every framework, every script, every agent runtime, every evaluation harness already speaks POST /v1/chat/completions. The base URL and the API key are the only two things a developer needs to change. That’s the entire onboarding.

# before
$ export OPENAI_BASE_URL=https://api.openai.com

# after
$ export OPENAI_BASE_URL=https://api.your-company.ai

One environment variable. That is the price of governance, paid by the user. If it is one line more, you have already lost.

The same applies to model names. Inside the gateway you can route mistral-large to a Mistral cluster in France, claude-sonnet to Anthropic’s EU region, llama-3-70b to a self-hosted machine — but the caller switches from one to the other by changing one string in the request. Same key. Same audit trail. Same Guard rules. The router is invisible.

This is the part vendors who sell “AI gateways” as drag-and-drop wizards consistently get wrong. The interface is not a feature. It is the contract that decides whether your platform exists or not.


A reverse proxy, not a library

If coverage is the first requirement, then governance has to live somewhere a request cannot avoid. The only place that satisfies that constraint is the network.

There are roughly three ways to inspect AI traffic:

A client-side library that wraps every call. Coverage depends on every developer remembering to install it. A sidecar process running next to each application. Better, but breaks for anything you don’t control. A reverse proxy in front of the model providers. Coverage by construction: if you want the model, you go through the proxy. There is no other route.

Only the proxy delivers the property “every call is inspected”. Everything else is a wishful subset.

Concretely: a small service sits between the rest of the world and a model gateway. It accepts traffic on two interfaces — a public one for users and personal scripts, an internal one for trusted services running inside the cluster. It does the five things in the diagram above, in order, and forwards the request upstream.

The two-interface split is the cheapest defense in the architecture. Internal services use long-lived service credentials that are useless from the outside. Users and their personal scripts use short-lived personal credentials that cannot impersonate a service. If either kind of credential leaks, the blast radius is contained to its half of the platform.


Identity: the only auth model that scales

Authentication in an LLM gateway is where most homegrown systems fail silently. The failure mode is always the same: someone creates a single API key, prints it on the wiki, and from that moment on you have lost the ability to answer the question “who made this call?”.

Three classes of caller need to coexist on the platform, and they need different credentials with different lifecycles: humans in a browser (corporate SSO), trusted internal services (long-lived service credentials, rotated by ops), and humans with scripts or IDE plugins (personal credentials created after SSO).

The non-obvious part is the third one. If you only support SSO, your developers can’t use the API from a CLI. If you only support API keys, you’ve recreated the wiki problem. The answer is to make the personal credential self-service after SSO: the user logs into the web app with their corporate identity, clicks “create key”, gets a credential shown once, and from that moment on the credential is bound to their identity in the platform’s database.

That binding is what matters. Every request the proxy forwards upstream carries the identity of the calling user — not the credential, but the human behind it. This is what makes per-user spend tracking, per-user budgets, and per-user audit trails possible.

The principle generalizes: every layer above the proxy works in terms of user identity, never in terms of credentials. Credentials are an implementation detail of the front door. The rest of the system never sees them.


The Guard: enforcing policy at the wire

Authentication tells you who is calling. It does not tell you what they are sending.

That is the Guard’s job, and it is the part of the architecture where the most companies stop and call it “future work”. They shouldn’t. The Guard is the entire reason the proxy exists. Everything else — auth, routing, billing — is undifferentiated infrastructure. The Guard is what distinguishes “we have an LLM gateway” from “we have an AI control plane”.

Architecturally, the Guard is a step in the request path of the proxy itself, not a separate service. The reasons matter:

Latency. A network hop to a separate inspection service adds 5–20 ms to every request. An in-process step costs essentially nothing. Atomicity. The Guard’s decision and the audit record have to be in the same code path. Single point of governance. “Every request goes through the Guard” is a claim you can prove by reading the proxy’s request handler, not by drawing arrows on a diagram.

What the Guard actually does: it scans the request body for two broad classes of content that should not leave the perimeter unaccompanied — secrets (API keys, private keys, OAuth tokens) and personal data (emails, phone numbers, national IDs, IBANs, payment cards). On a hit, it produces a finding. On a finding, it takes one of three actions: log it, redact the sensitive part, or block the request entirely.

On secrets, do not reinvent the wheel. Detecting credentials in arbitrary text is a well-defined problem with one obviously correct answer: use the open-source detectors that have absorbed years of bug reports and edge cases. The temptation to write your own should be ignored.

On personal data, you have to do it yourself, but only for European formats. The serious open-source PII libraries are English-centric. None of them know what a French phone number looks like, or how to validate a numéro de sécurité sociale, or how to recognize an IBAN starting with FR76. So you write the patterns yourself, you test them on real corpora, and you accept that this is part of the price of taking European compliance seriously.

Log, then enforce. Never the other way around.

Every detector ships in log mode first. It records findings without modifying the request. You watch the audit trail for a week or two. You discover the false positives nobody anticipated. You tune. Then, and only then, you graduate the detector to redact. Eventually, for the highest-confidence rules, you graduate to block.

The most common failure mode of corporate Guard deployments is to skip the log phase and start in block mode, which produces a wave of legitimate-looking false positives and an instant credibility crisis. The Guard never recovers from that first week.

The audit trail is the product

The output of the Guard is not the action it takes on a single request. The output of the Guard is the searchable history of decisions that grows from every request the platform has ever seen.

That history records, for each finding: which user, which model, which kind of pattern, which action was taken, and when. It does not record the matched value (privacy) and it does not record the prompt (volume). It records the fact that something happened.

This is more important than the scanners themselves. Scanners are replaceable. The audit trail is the contract with the rest of the organization — with legal, with compliance, with the security team, with the regulator who will eventually ask. A control plane without a real audit trail is a chat tool with extra steps.


The sovereignty cursor

The hardest design decision in the platform is the one that looks like a marketing slogan: sovereignty is not all-or-nothing.

The temptation is to ship two products. A “sovereign” tier that only routes to French-hosted open-source models, and a “frontier” tier that routes to US-based commercial providers. The user picks one and lives with the consequences.

This does not survive contact with reality. A typical employee needs frontier-grade reasoning for a strategic memo on Monday and is happy with a self-hosted Mistral for an HR template on Tuesday. Forcing a company-wide tier choice means either over-paying for sovereignty on traffic that doesn’t need it, or leaking sensitive data through a frontier model because nobody wanted to switch tools.

The architectural answer is to make sovereignty a property of the request, not of the user, not of the workspace. The Guard already inspects the body. If it finds personal data or a contract clause, the routing layer can redirect the call to a sovereign endpoint, transparently, while the user is typing. The user does not have to know the rules. The rules know the request.

This is also why the Guard and the router live in the same process. The decision “this request must be sovereign” is taken by the Guard one step before the routing call that applies it. There is no IPC, no queue, no eventual consistency. The architecture makes the policy enforceable rather than aspirational.


Budgets at the perimeter

The last layer is unglamorous and load-bearing.

Every model provider charges per token. Every internal team eventually writes an agent loop that forgets to terminate. Without a budget enforced below the application layer, the first time you find out is when the invoice arrives.

The platform handles this by checking, on every authenticated request, whether the calling user has a budget and whether they are within it. If the answer is no, the request is refused at the proxy with a clean error, before the model is contacted. A user with no budget never burns a single token.

The point is the order. The check happens at the perimeter, not at the model. This is the same reason API gateways do rate limiting at the edge instead of inside each microservice. AI is not an exception, except in scale: the cost per accidental call is two orders of magnitude higher than a normal API request.


What the operator actually sees

A control plane that no one can run is not a control plane. The architecture above produces three operator-visible surfaces, one per role:

The developer sees an OpenAI-compatible endpoint, a self-service page to create and revoke personal credentials, and a per-key spend view. Their workflow does not change. Their existing tools work without modification.

The admin sees a list of users, the budgets they hold, the models they can access, and the rules the Guard is enforcing. They can change all of these without redeploying anything.

The security officer sees the audit trail, filtered by user, by model, by finding type, by time window. They get exports for compliance. They get a digest in their inbox when something deserves attention.

These three surfaces are what an organization actually buys when it adopts the platform. The proxy is the engine. The dashboards are the steering wheel. Both are required. A control plane with no UI is a research project; a UI with no control plane is theater.


Taking it further: closing the inference gap

There is one limit to the architecture above that is worth being honest about.

The proxy controls the request path. It sees the prompt, it inspects it, it can refuse or rewrite it, and it knows exactly which model received it. But the moment the prompt leaves the proxy and reaches the model — even a model hosted in Europe by a European provider — the content of the prompt is, briefly, in plaintext on someone else’s hardware.

This is acceptable for most enterprise traffic. But there are workloads — financial reconciliations, medical records, classified material — where “plaintext on someone else’s hardware, even briefly” is not acceptable, no matter the jurisdiction.

Path 1 — End-to-end encrypted inference

The first path keeps the architecture intact and changes only the model client. Providers expose an inference API where the prompt is encrypted by the caller, decrypted only inside a hardware enclave on the inference side, processed by an open-source model running inside that enclave, and re-encrypted before the response leaves. The provider never sees the plaintext, by construction.

For a control plane built around a clean OpenAI-compatible interface, swapping in an end-to-end encrypted backend is a plumbing change in one place. The router learns about a new model name, the model client learns about a new transport, and the rest of the architecture does not move.

Path 2 — GPU TEEs

The second path goes one step further: run the model itself inside a hardware-encrypted GPU. NVIDIA H100 Confidential Computing, and similar designs, allow a GPU to keep its memory encrypted and refuse to expose it to anything outside an attested enclave. An open-source model running inside such a GPU never has its weights, activations or prompts in cleartext outside the chip.

Both paths matter for the same reason: they let the control plane offer a sovereignty cursor that goes further than EU hosting. Today the cursor has two notches: “frontier model in EU” and “open-source model in EU”. Tomorrow it will have a third: “open-source model in EU under hardware-attested isolation”. The architecture does not need to change to support that third notch.

This is the test of whether you have built a control plane or just a proxy. A real control plane outlives the model providers it routes to. It survives the next generation of inference, and the one after.


The point worth keeping

The four properties — coverage, identity, inspection, isolation — are not exotic. They are what your network team has been doing for the public web since the 2000s. The reason they have to be re-implemented for AI is not that AI is special; it is that AI traffic does not pass through the same equipment.

A control plane is the answer to a question the previous article posed: do you know what your AI is doing? Without one, the answer is no, no matter how convincing your policy document is. With one, the answer is “look at the audit trail”.

The architecture is not particularly clever. It is a reverse proxy in front of a model gateway, with an SSO-backed identity layer, an inspection step that scans request bodies for things you don’t want to leak, an audit trail that records what the inspection found, and a budget check that runs before any of it. Every piece is boring on its own. Together they produce the only thing a regulator, a CISO or a CFO actually wants: the ability to answer the question.

The interesting part is not the boxes. It is what happens once you have them. You stop arguing about whether to allow ChatGPT. You allow it on the requests where it is appropriate, you route the others to a sovereign model, you log every decision, you bill it back to the team that made it, and the conversation moves from “AI is too risky” to “which workloads benefit most from which models”. Which is the conversation enterprises were supposed to be having two years ago.

If you take only one idea away from this piece: the control plane is not an AI feature. It is the unglamorous network infrastructure that turns “we have AI” into “we know what our AI is doing”. It is the difference between a policy and a fact.

And it is, finally, the only honest response to the previous article’s question. Saying no to AI does not protect anyone. Saying yes, and here is exactly what is happening every time someone uses it — that’s the answer.

This piece is part of a series on building a sovereign control layer for enterprise AI. The previous article, “Enterprise AI: The All-or-Nothing Trap”, argued why prohibition fails. This one describes what to build instead.