Skip to content

Warm-pooled sandboxes for the Kubernetes compute driver #1879

@rmalani-nv

Description

@rmalani-nv

Problem Statement

Creating a Kubernetes sandbox is a cold start: the gateway creates a Sandbox CR, the agent-sandbox controller schedules a Pod, the image is pulled (or read from cache), the supervisor boots, and only then is the sandbox Ready. Measured locally this is ~4s+ even with the image preloaded. For interactive agent workloads and high-churn "fresh sandbox per task" usage, that latency dominates time-to-first-action.

We want near-instant Kubernetes sandbox provisioning via a warm pool of pre-provisioned, ready Pods.

This issue scopes the design (moved here from PR #1813 per the issue-first RFC process). It supersedes the draft RFC and incorporates the security review on that PR.

Proposed Design

Adopt the upstream agent-sandbox warm-pool extension CRDsSandboxTemplate, SandboxWarmPool, SandboxClaim (extensions.agents.x-k8s.io/v1alpha1) — already shipped in the v0.4.6 release OpenShell pins for the core Sandbox CRD. The gateway pre-declares operator-owned warm pools; on CreateSandbox, when the requested shape matches a pool, the Kubernetes driver creates a SandboxClaim that binds a pre-warmed Pod in ~0.1s. Non-matching requests fall back to the existing cold path.

Identity re-anchor (security-critical)

Today validate_sandbox_owner_reference() (crates/openshell-server/src/auth/k8s_sa.rs) cross-checks the owning Sandbox CR label openshell.ai/sandbox-id against the Pod annotation openshell.io/sandbox-id. Warm Sandboxes are created generically by the pool controller and carry agents.x-k8s.io/claim-uid + a controlling SandboxClaim ownerReference instead, so identity must re-anchor to the gateway-created SandboxClaim. The IssueSandboxToken warm path must enforce a strict fail-closed chain, with any mismatch rejecting:

  • TokenReview audience/SA/pod-name/pod-UID match the live Pod.
  • Pod has exactly one controlling Sandbox ownerReference (matching UID).
  • That Sandbox has exactly one controlling SandboxClaim ownerReference (matching name + UID); agents.x-k8s.io/claim-uid agrees.
  • The live SandboxClaim exists, has the expected UID, is bound, and status.sandbox.name equals the owning Sandbox.
  • The gateway Store has a durable, gateway-created record for (namespace, claim name, claim UID) containing the expected sandbox ID, and it equals the Pod's openshell.io/sandbox-id annotation.

The mapping must live in the shared gateway Store (HA-safe; any replica may handle bootstrap), and claim creation must be ordered before a Pod can bootstrap. Users must not be able to set reserved metadata (openshell.io/*, agents.x-k8s.io/*, identity/SPIFFE keys, SandboxClaim.spec.additionalPodMetadata, spec.env).

Workspace isolation: separate the private workspace from shared data

The data-isolation concern is about the writable, per-agent workspace — not the large files sandboxes legitimately read. Conflating the two is the trap. The design uses two volumes with opposite requirements:

1. Writable per-agent workspace (/sandbox) — agent-private state (code, credentials, gateway JWTs, intermediate files). Must never be visible to another agent. Isolation options:

  • Ephemeral (e.g. emptyDir) when reschedule-survival is not required — fail-safe: the kubelet deletes it when the Pod is gone, so there is no object to orphan and no dependence on cleanup being correct.
  • Per-Sandbox PVC, single-use, destroyed on teardown when the workspace must survive pod rescheduling for that agent — this requires reliable destruction on every teardown path and a reclaimPolicy that actually wipes backing storage (Delete, not Retain).
  • Either way: a warm Pod is seeded pristine from the image and never runs user code while pooled; a claimed Sandbox/Pod/(PVC) is single-use and never returned to the pool.

2. Shared large-file volume (datasets, models, caches) — mounted read-only into every warm + claimed sandbox via the SandboxTemplate, so big files are available on the filesystem without streaming them through the relay. Sharing here is intended and safe because it is read-only and holds no per-agent secrets — there is nothing private to leak. It is long-lived and untouched by pooling/claiming (not per-agent state), and being pre-attached, a claimed sandbox reads it instantly.

  • Read-sharing across Pods on different nodes requires an access mode / provisioner that supports it (ReadOnlyMany/ReadWriteMany: NFS, CephFS, EFS, Filestore, or a FUSE/object-store CSI). The k3d local-path provisioner (RWO, single-node) does not support multi-reader sharing.

Empirical basis (see Agent Investigation): each warm Sandbox already gets its own PVC and the pool replenishes with fresh Sandboxes (claimed ones are not recycled). The remaining risk is purely lifecycle — under the default shutdownPolicy: Retain, deleting a claim deletes the Sandbox/Pod but leaves the workspace PVC orphaned with user data (a written marker survived). That is closed by the single-use + explicit-destruction rule above, or avoided entirely by an ephemeral workspace.

Cross-agent writable sharing (e.g. agent A's output reused by agent B as a build/model cache) is a separate, deliberate decision — not the private workspace — and must be scoped so credentials/secrets never land in it.

Single-use still preserves the latency win: a warm Pod is pre-scheduled, image-pulled, supervisor-booted, and workspace-prepared before claim (~4s → ~0.1s measured).

Open decision for this issue: which workspace model is the default for pooled sandboxes — ephemeral, or per-Sandbox PVC destroyed-on-teardown — noting that large shared files live on the separate read-only volume either way.

Scope guardrails

Initially, only operator-declared pools using trusted templates/images are warm-pooled; user-supplied images or arbitrary per-request templates fall back to the cold path until per-tenant pool isolation and cleanup guarantees are designed.

What bakes vs. late-binds

  • Baked into the shared SandboxTemplate: image, mTLS mount, projected SA-token volume, supervisor sideload, capabilities, host aliases, runtimeClass, resources, and any read-only shared data volume (large files / datasets).
  • Per-claim, isolated: the writable per-agent workspace — ephemeral, or a per-Sandbox PVC destroyed on teardown — never the shared data volume.
  • Injected per-claim (annotation only): openshell.io/sandbox-id (per-claim env is rejected on the warm path).
  • Late-bound over the supervisor relay (already works): policy, providers. Identity is established by the existing token exchange, not Pod env.

Phased plan

  1. Settle this design (this issue).
  2. Driver warm path (flagged): create SandboxClaim instead of Sandbox for pooled shapes; gateway RBAC for extensions.agents.x-k8s.io; durable Store claim mapping. Install extensions.yaml in dev/e2e alongside this consumer (not before).
  3. Auth re-anchor: implement the fail-closed chain in k8s_sa.rs + adversarial tests.
  4. Single-use lifecycle + volume model: isolate the writable workspace (ephemeral or per-Sandbox PVC destroyed on teardown); mount shared large-file data read-only; workspace-isolation e2e (claim → write secret to the writable workspace → delete → re-claim → assert clean + workspace not reused, and assert the shared volume is read-only and carries no per-agent data).
  5. Pool management + surface/docs.

Alternatives Considered

  • Patch identity onto the claimed Pod after bind (keep the label cross-check): requires granting the gateway patch pods (deliberately denied for immutability) and is racy.
  • Bare-Pod warm pools (if upstream pools created Pods, not Sandbox CRs — see upstream issue refactor(build): unify image build graph for cache reuse #390): would break the ownerReference auth chain. v0.4.6 creates Sandbox CRs.
  • Do nothing: accept cold-start latency. Viable for low-churn usage, poor for interactive agents.

Agent Investigation

Validated on a local k3s (k3d) cluster with agent-sandbox v0.4.6 (core + extensions):

  • Identity: claim binds in ~0.13s; the warm Pod is owned by a controlling Sandbox CR (ownerRef chain intact); the claim-injected openshell.io/sandbox-id annotation lands on the Pod; per-claim env is rejected on the warm path. The bound Sandbox carries agents.x-k8s.io/claim-uid + a controlling SandboxClaim ownerRef. The current validate_sandbox_owner_reference() fails closed against warm Sandboxes (they lack openshell.ai/sandbox-id), so there is no exploit in the install-only PR.
  • Workspace: SandboxTemplate.volumeClaimTemplates → each warm Sandbox gets its own Bound PVC (2 warm pods → 2 distinct PVCs, each owned by its Sandbox). Claiming replenished the pool with a new Sandbox + new PVC (claimed one not recycled). Deleting the claim deleted the Sandbox but left the workspace PVC Bound holding a written TENANT-A-SECRET marker — shutdownPolicy default is Retain, confirming the orphaned-workspace data risk.
  • Baseline: the cold path is unaffected — sandbox createReady, IssueSandboxToken → minted JWT, echo executed inside the sandbox over the supervisor relay.

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions