Insight · 03May 202610 min read#agentic#runtime#operations

Running AI agents on Red Hat OpenShift AI: lessons from a sovereign deployment

Running AI agents in production is a runtime engineering problem before it is an agent engineering problem — and the runtime layer determines what governance is feasible above.

For: Platform engineering · Runtime engineers · MLOps leads · Infrastructure architectsFrom the team · Alquimia

The platform engineering lead gets a call. The compliance team has been asked, by a regulator, to explain a specific decision an AI agent made three months ago — a credit application that was auto-declined at 02:47 on a Tuesday. The team pulls the audit trail. Prompt: captured. Tools called: captured. Model and version: captured. Output: captured. Outcome: captured.

The regulator reads the trace and asks the next question. “I see what the agent did. I need to understand why the model reached that conclusion.” The platform engineering lead pauses. The platform captured every decision the agent made. What the model did inside the inference is on no dashboard the team has ever built — because that information lives in a different layer.

The story is becoming common in regulated industries, and it points to a structural realization. Running AI agents in production is a runtime engineering problem before it is an agent engineering problem. The platform sits on top of a runtime, and the runtime is where the inference actually executes. The runtime layer is where GPU economics, hardware-level observability, and model-level explainability live — and the choices an organization makes there determine what governance is feasible above.

Running AI agents in production is a runtime engineering problem before it is an agent engineering problem — and the runtime layer determines what governance is feasible above.

01The shift: AI agents are inference workloads now

A year ago, “AI agent” was a phrase that mostly meant “let's call an API”. The model lived behind a vendor endpoint; the team's job was to construct the prompt, parse the response, and ship the result. The inference itself was someone else's problem.

That model still works for experiments. It works less well for production AI at the scale and governance posture that regulated organizations require. The agent that matters today is more than a single API call. It is a bundle of inference workloads running together — a primary language model that produces the response, smaller guardrail models that screen input and output for safety and compliance, embedding models for retrieval, sometimes a reasoning model for steps that need more horsepower. Each of these is an inference workload with its own hardware footprint, latency profile, and observability needs.

The MLOps teams that have been running production ML for years know this. The agent teams now meeting them in production are arriving from a different direction — they built the agent in a notebook, and they assumed the inference layer was somebody else's concern.

That changes when the agent moves to production traffic. The inference layer becomes the operational substrate of the agent. GPU utilization, model loading times, batch throughput, and memory pressure are now properties the platform team has to understand — because if the runtime layer fails, the agent fails, and the customer is the one who notices.

02What the runtime layer has to do

For an organization-grade AI agent deployment, the runtime layer has to provide three things that the agent platform above it cannot.

GPU economics that survive the second agent. A single AI agent is rarely a single model. A production-grade agent typically calls a primary LLM, runs the input and output through one or two guardrail models, and uses an embedding model for retrieval — three to five models per request. Multiply that by ten agents and the naive answer (dedicate a GPU to each model) hits the cost wall in the first month. The runtime layer has to manage GPU resources as a shared, scheduled, multi-tenant surface — so that GPU spend scales with inference volume, regardless of how many agents the organization runs on top.

Hardware-level observability, alongside agent-level observability. Alquimia Agentic Platform observes the agent: every prompt, every tool call, every decision. That is the right scope for the agent layer. It is also incomplete. The runtime layer has to add the layer beneath — GPU utilization per inference, model loading and warm-up times, queue depth at the inference server, and the resource competition between concurrent workloads. When latency starts to drift in production, the team cannot debug it from prompts alone. They need to see what the GPU was doing at the moment the latency hit.

Model explainability that an auditor will accept. This is the property that has changed most recently. Compliance posture in regulated environments has shifted from did the agent give the right answer? to why did the model reach this conclusion?. Producing the prompt, tools, and output is no longer sufficient for high-stakes decisions. The runtime layer has to expose the chain of reasoning inside the model — feature attributions, counterfactual explanations, decision boundaries — in a form an external auditor can interpret.

These three properties — GPU economics, hardware observability, model explainability — are runtime-layer concerns. None of them belong inside an individual agent. None of them belong inside the platform's agent-level observability either. They live one layer below, on the inference substrate.

03Why Red Hat OpenShift AI as the runtime layer

Red Hat OpenShift AI is the runtime layer we and our customers have chosen for production-grade AI agent deployments. The choice is specific. Three capabilities of the platform map directly to the three runtime-layer concerns above.

Time-Slicing for inference workloads.OpenShift AI supports NVIDIA's GPU sharing capability, which lets multiple inference workloads share GPU resources concurrently — primary model, guardrail models, embedding model, all on the same hardware, scheduled in time slices. The practical effect for an agent fleet: the same GPU cluster that ran one agent at 60% utilization can run five agents at 90% utilization, without provisioning dedicated cards per workload. GPU spend stops scaling with agent count and starts scaling with inference volume.

Native model and hardware observability. OpenShift AI ships with model serving primitives that expose GPU utilization, model loading metrics, inference latency distributions, and queue depths — all integrated with the Kubernetes-native observability surface the platform engineering team already operates. The agent-level traces from Alquimia Agentic Platform and the model-level metrics from OpenShift AI compose into one operational picture. When a customer interaction is slower than expected on Tuesday afternoon, the team can trace the latency from the agent invocation, down through the platform, into the inference server, onto the GPU — in a single drill-down.

TrustyAI for model explainability. OpenShift AI ships TrustyAI, an open explainability suite that integrates with the on-prem deployment. TrustyAI provides feature attributions and counterfactual explanations for the model decisions running on the platform — the kind of evidence an external auditor will recognize. The explainability is reproducible because TrustyAI is open source. The audit committee gets evidence built on tooling whose code they can inspect.

All three capabilities are open-source-rooted and run on the infrastructure the organization controls. That matters because the same sovereignty argument that brought the organization to a sovereign agent platform applies one layer below. There is no sovereign AI agent if the runtime underneath it is a black box.

04The architecture pattern that emerges

When the three layers are stacked correctly, the pattern is clean.

Agent layer

Alquimia Agentic Platform

Studio · Runtime · Registry · Observability · Governance · SDK + CLI

Runtime layer

Red Hat OpenShift AI

Time-Slicing · model serving · TrustyAI · Kubernetes-native observability

Foundation

The organization's infrastructure

on-prem · private cloud · hybrid

Three layers — independently replaceable

This pattern is the basis of the architecture ArSAT chose for their production AI agent deployment, documented in the public April 2026 press release and explored in our customer support modernization use case page. The reason we recommend the pattern to other organizations is that each layer is independently replaceable.

The platform can be replaced without re-architecting the runtime. The runtime can be replaced without re-writing the agents. The foundation — Kubernetes, OpenShift, the underlying hardware — can be replaced without re-tooling the platform above it. Replaceability is core to the design. It is what makes the pattern survive procurement decisions over the years.

Sovereignty becomes a property of the whole stack. Every decision an AI agent makes is traceable from the agent invocation, through the platform's audit trail, into the runtime's inference metrics, onto the GPU that produced the response — all on infrastructure the organization controls. The audit committee's question about explainability has an answer at every layer.

05What to do this quarter

Three diagnostic items for the platform engineering team.

First, audit the runtime layer underneath your current AI agents. Most teams have not done this. List, for each agent in production, what runs the inference (vendor endpoint, in-cluster model server, dedicated GPU cluster). Where the answer is we are not sure, the runtime layer has not yet been chosen — it has been improvised.

Second, check the explainability surface.Can the team produce, on demand, the chain of reasoning behind a specific model decision from the last ninety days — beyond the prompt and the output, the feature attribution or counterfactual that explains why the model reached its conclusion? Where the answer is no, the regulator's next question has no answer.

Third, decide whether the runtime layer is part of the organization's procurement conversation, or an afterthought. GPU economics, hardware observability, and model explainability are runtime-layer choices, and they shape what governance is feasible above. The audit committee will care.

The agent layer makes the case for AI in production. The platform layer makes it operable. The runtime layer is what makes it sustainable — economically, observationally, and from an audit posture. At Alquimia we craft Agentic Platform for the agent layer, integrate with Red Hat OpenShift AI for the runtime layer, and publish Gaussiafor the eval layer that spans them both. If your team is building toward production-grade AI agents and the runtime layer is on this quarter's roadmap, we would be glad to walk you through how we approach it.

Older

This is the first article in the track.

Newer · Insight 02

From a notebook to a fleet: why AI agents need a platform layer.

May 2026

01The shift: AI agents are inference workloads now

02What the runtime layer has to do

03Why Red Hat OpenShift AI as the runtime layer

04The architecture pattern that emerges

05What to do this quarter

Bring your problem. We'll walk you through it.