Running AI Agents on Kubernetes: Agent Sandbox, AI Gateway, and the Platform Gaps They Fix

Bits Lovers
Written by Bits Lovers on
Running AI Agents on Kubernetes: Agent Sandbox, AI Gateway, and the Platform Gaps They Fix

Kubernetes is finally getting serious about the parts of AI systems that do not fit a normal Deployment. On March 9, 2026, the Kubernetes community announced the AI Gateway Working Group. Eleven days later, it published a deep look at Agent Sandbox. Those two signals belong together. One is about how requests reach AI workloads. The other is about how long-lived, isolated agent runtimes actually live on the cluster.

That split matters because platform teams keep trying to solve both problems with one tool. An ingress controller cannot decide which model replica is least saturated based on KV-cache pressure. A plain Deployment is also a bad fit for a stateful agent runtime that needs a stable identity, persistent working files, and an isolated lifecycle that looks more like a lightweight VM than a stateless web pod.

If you want the current AWS framing first, start with the AI on EKS guide. If you are building more AWS-native agent runtimes, the Bedrock AgentCore shell and session storage guide is the right comparison point. This post is about the Kubernetes-native side: AI Gateway for inference traffic, Agent Sandbox for runtime isolation, and why platform teams should treat them as complementary layers instead of competitors.

The Problem AI Gateway Is Trying To Solve

Normal Gateway API works well for ordinary HTTP routing. Match the host, match the path, send traffic to a backend. Generative AI traffic is uglier. The request does not just need a pod that is healthy. It needs a pod that is a good fit for the model, the LoRA adapter, the queue state, and the actual capacity left on the accelerator.

The Gateway API Inference Extension docs describe the flow clearly. Gateway API still does the first routing decision using Gateway and HTTPRoute. But if the backend is an InferencePool, the gateway hands request context to an endpoint selection extension, which can look at model-server metrics such as KV-cache utilization, queue length, and active adapters before telling the gateway which endpoint should receive the request.

That is a different class of routing logic. It is closer to scheduler-assisted request placement than to normal L7 load balancing.

The Kubernetes AI Gateway Working Group exists because the ecosystem needs common behavior here. Otherwise every vendor builds its own ad hoc inference router, its own request metadata contract, and its own failure semantics. That becomes platform debt fast.

What Agent Sandbox Solves Instead

Agent Sandbox is not an inference router. It is a runtime pattern for workloads that need stable identity, persistence, and isolation, but do not map well to Deployments or StatefulSets.

The project defines a Sandbox CRD and a controller that manages a single, stateful pod with stable network identity and persistent storage. The extensions layer adds three especially useful pieces:

  • SandboxTemplate for reusable definitions
  • SandboxClaim so users request a sandbox without knowing the lower-level details
  • SandboxWarmPool so pre-warmed sandboxes are ready before a user or agent claims one

That last piece is what makes the project feel practical for AI agents rather than merely interesting. Agent runtimes often need to pull tools, warm a model, hydrate state, mount working storage, or restore prior context. Waiting for all of that at claim time is exactly how a clean demo turns into a miserable production experience.

Agent Sandbox gives you a Kubernetes-native way to say: keep some isolated runtimes warm, preserve a stable identity, and hand them out on demand.

The Better Mental Model: North-South Plane Vs Runtime Plane

This is the cleanest way to think about the two efforts.

AI Gateway is the north-south traffic plane. It decides how an inference request enters the platform and which inference-serving endpoint should handle it.

Agent Sandbox is the runtime plane. It gives you a place for long-lived, stateful, singleton workloads that behave more like agents, coding sandboxes, or per-user workspaces than replicated stateless services.

You can absolutely need both in the same platform.

A request comes through a gateway and gets routed to the right inference pool for model execution. That model or controller then hands a tool task to an isolated sandbox where the agent can write files, run commands, or maintain long-lived session state. The gateway is not the sandbox. The sandbox is not the router.

This is also why the comparison to Bedrock AgentCore Gateway server-side tool execution is useful. Bedrock Gateway centralizes tool access. Kubernetes AI Gateway is about request steering to inference backends. They solve related but different coordination problems.

Where The Gateway API Inference Work Already Looks Useful

The inference extension is already past the vague stage. The project documents an InferencePool resource and a model where HTTPRoute can send traffic to that pool, while an endpoint picker extension makes fine-grained placement choices.

That buys you several things platform teams actually need.

The first is backend awareness. A generic ingress controller knows health. The inference extension can know queue depth and cache pressure.

The second is model lifecycle control. The project has explicit work for model rewrites, traffic splitting, and pool rollouts. That is the right shape for canarying model versions or aliasing a friendly model name to the real backend version without forcing every client to change requests.

The third is conformance pressure. The docs call out conformance expectations for gateway implementations, routing extensions, and model server frameworks. That matters because AI platforms already have enough custom glue.

If this stays healthy, platform teams get a better standard layer for self-hosted model serving than “pick a vendor router and hope it still fits next year.”

Where Agent Sandbox Looks Better Than Plain Kubernetes Primitives

A Deployment wants fungible replicas. A StatefulSet wants numbered replicas with ordered rollout. Many agent runtimes want neither.

A coding agent, browser automation runtime, or per-user analysis environment often wants exactly one pod, a stable identity, persistent files, and the ability to pause or prewarm. The Agent Sandbox README is explicit about that goal: a lightweight, single-container VM experience built on Kubernetes primitives.

That framing is honest. Plenty of teams are already abusing StatefulSets or homegrown controllers to approximate the same behavior. Agent Sandbox takes the pattern and makes it declarative.

The installation is simple enough to test today:

export VERSION="vX.Y.Z"

kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/manifest.yaml
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/extensions.yaml

Once the controller is installed, the resources you care about operationally are easy to inspect:

kubectl get sandbox
kubectl get sandboxclaim
kubectl get sandboxwarmpool

That is not a full production design, but it is enough to validate whether the runtime model fits your workload better than another round of StatefulSet compromise.

The Real Operational Gaps That Still Belong To You

Neither project removes platform engineering. They just move the work into cleaner boundaries.

For AI Gateway, you still own auth, policy, and budgeting. The router can make smarter placement decisions, but it does not decide who is allowed to call the model, which tenants can burn which GPU pools, or how you meter expensive requests.

For Agent Sandbox, isolation is better than a normal shared pod model, but you still need to design the node strategy carefully. If you are running untrusted code, you should think about runtime class, node isolation, storage policies, network policy, image provenance, and who can claim a warm sandbox in the first place. Warm pools are wonderful for latency and terrible for cost discipline if they are left unconstrained.

There is also a design trap here. Platform teams sometimes hear “AI agents on Kubernetes” and assume the runtime must be Kubernetes-native just because the rest of the platform is. That is not always true. Some teams will still be better off with a managed runtime and a Kubernetes-hosted inference layer. Others will want the full cluster-native stack. The AWS Agent Registry preview post gets at the governance side of this problem from another angle: agent platforms get messy the moment reuse starts.

The Three Mistakes I Would Expect First

The first mistake is putting agent state in the wrong place. The gateway should not own runtime state. The sandbox should not pretend to be the request router.

The second mistake is ignoring warm capacity economics. Pre-warmed sandboxes and always-hot inference pools reduce latency, but they also create a standing bill. Tie them to real service objectives, not just engineering preference.

The third mistake is assuming Kubernetes primitives are enough without workload-specific telemetry. AI Gateway only gets interesting because it can route on inference-specific signals. If your model servers do not expose the right metrics, the fancy routing layer collapses back toward generic load balancing.

When I Would Use This Stack

I would use the gateway work if I were standardizing self-hosted inference on Kubernetes and needed a routing layer that understands model-serving realities. I would use Agent Sandbox if the runtime needed per-user or per-task isolation, a stable identity, persistent files, and warm reuse.

I would not force either project onto a team that only needs a simple model endpoint or a short-lived batch job. The sophistication is justified when the platform actually has AI-routing complexity or agent-runtime complexity. Otherwise you are just inventing new nouns.

The Practical Recommendation

Treat AI Gateway and Agent Sandbox as two building blocks in an emerging Kubernetes AI platform, not as the entire platform.

Use AI Gateway to standardize inference traffic and model-aware routing. Use Agent Sandbox where agent runtimes need singleton lifecycle, warm starts, and stable state. Keep authentication, quota, audit, and cost controls as separate first-class platform concerns.

That is the right bar for platform engineering in this area. Cleaner routing. Cleaner runtime isolation. No pretending those two improvements remove the need for discipline everywhere else.

Sources

Bits Lovers

Bits Lovers

Professional writer and blogger. Focus on Cloud Computing.

Comments

comments powered by Disqus