Warm-pooled sandboxes for the Kubernetes compute driver

## Problem Statement

Creating a Kubernetes sandbox is a cold start: the gateway creates a `Sandbox` CR, the agent-sandbox controller schedules a Pod, the image is pulled (or read from cache), the supervisor boots, and only then is the sandbox `Ready`. Measured locally this is ~4s+ even with the image preloaded. For interactive agent workloads and high-churn "fresh sandbox per task" usage, that latency dominates time-to-first-action.

We want near-instant Kubernetes sandbox provisioning via a **warm pool** of pre-provisioned, ready Pods.

This issue scopes the design (moved here from PR #1813 per the issue-first RFC process). It supersedes the draft RFC and incorporates the security review on that PR.

## Proposed Design

Adopt the upstream agent-sandbox **warm-pool extension CRDs** — `SandboxTemplate`, `SandboxWarmPool`, `SandboxClaim` (`extensions.agents.x-k8s.io/v1alpha1`) — already shipped in the `v0.4.6` release OpenShell pins for the core `Sandbox` CRD. The gateway pre-declares **operator-owned** warm pools; on `CreateSandbox`, when the requested shape matches a pool, the Kubernetes driver creates a `SandboxClaim` that binds a pre-warmed Pod in ~0.1s. Non-matching requests fall back to the existing cold path.

### Identity re-anchor (security-critical)

Today `validate_sandbox_owner_reference()` (`crates/openshell-server/src/auth/k8s_sa.rs`) cross-checks the owning `Sandbox` CR label `openshell.ai/sandbox-id` against the Pod annotation `openshell.io/sandbox-id`. Warm Sandboxes are created generically by the pool controller and carry `agents.x-k8s.io/claim-uid` + a controlling `SandboxClaim` ownerReference instead, so identity must re-anchor to the **gateway-created `SandboxClaim`**. The `IssueSandboxToken` warm path must enforce a strict fail-closed chain, with any mismatch rejecting:

- TokenReview audience/SA/pod-name/pod-UID match the live Pod.
- Pod has exactly one controlling `Sandbox` ownerReference (matching UID).
- That `Sandbox` has exactly one controlling `SandboxClaim` ownerReference (matching name + UID); `agents.x-k8s.io/claim-uid` agrees.
- The live `SandboxClaim` exists, has the expected UID, is bound, and `status.sandbox.name` equals the owning `Sandbox`.
- The gateway **Store** has a durable, gateway-created record for `(namespace, claim name, claim UID)` containing the expected sandbox ID, and it equals the Pod's `openshell.io/sandbox-id` annotation.

The mapping must live in the **shared gateway Store** (HA-safe; any replica may handle bootstrap), and claim creation must be ordered before a Pod can bootstrap. Users must **not** be able to set reserved metadata (`openshell.io/*`, `agents.x-k8s.io/*`, identity/SPIFFE keys, `SandboxClaim.spec.additionalPodMetadata`, `spec.env`).

### Workspace isolation: separate the private workspace from shared data

The data-isolation concern is about the **writable, per-agent workspace** — not the large files sandboxes legitimately read. Conflating the two is the trap. The design uses **two volumes with opposite requirements**:

**1. Writable per-agent workspace (`/sandbox`)** — agent-private state (code, credentials, gateway JWTs, intermediate files). Must never be visible to another agent. Isolation options:

- **Ephemeral** (e.g. `emptyDir`) when reschedule-survival is not required — fail-safe: the kubelet deletes it when the Pod is gone, so there is no object to orphan and no dependence on cleanup being correct.
- **Per-Sandbox PVC, single-use, destroyed on teardown** when the workspace must survive pod rescheduling for that agent — this requires reliable destruction on *every* teardown path and a `reclaimPolicy` that actually wipes backing storage (`Delete`, not `Retain`).
- Either way: a warm Pod is seeded **pristine from the image** and never runs user code while pooled; a claimed `Sandbox`/Pod/(PVC) is **single-use** and never returned to the pool.

**2. Shared large-file volume (datasets, models, caches)** — mounted **read-only** into every warm + claimed sandbox via the `SandboxTemplate`, so big files are available on the filesystem without streaming them through the relay. Sharing here is intended and safe **because it is read-only and holds no per-agent secrets** — there is nothing private to leak. It is long-lived and untouched by pooling/claiming (not per-agent state), and being pre-attached, a claimed sandbox reads it instantly.

- Read-sharing across Pods on different nodes requires an access mode / provisioner that supports it (`ReadOnlyMany`/`ReadWriteMany`: NFS, CephFS, EFS, Filestore, or a FUSE/object-store CSI). The k3d `local-path` provisioner (RWO, single-node) does **not** support multi-reader sharing.

**Empirical basis (see Agent Investigation):** each warm `Sandbox` already gets its **own** PVC and the pool replenishes with **fresh** Sandboxes (claimed ones are not recycled). The remaining risk is purely lifecycle — under the default `shutdownPolicy: Retain`, deleting a claim deletes the `Sandbox`/Pod but **leaves the workspace PVC orphaned with user data** (a written marker survived). That is closed by the single-use + explicit-destruction rule above, or avoided entirely by an ephemeral workspace.

> Cross-agent *writable* sharing (e.g. agent A's output reused by agent B as a build/model cache) is a separate, deliberate decision — not the private workspace — and must be scoped so credentials/secrets never land in it.

Single-use still preserves the latency win: a warm Pod is pre-scheduled, image-pulled, supervisor-booted, and workspace-prepared before claim (~4s → ~0.1s measured).

**Open decision for this issue:** which workspace model is the default for pooled sandboxes — ephemeral, or per-Sandbox PVC destroyed-on-teardown — noting that large shared files live on the separate read-only volume either way.

### Scope guardrails

Initially, only **operator-declared pools using trusted templates/images** are warm-pooled; user-supplied images or arbitrary per-request templates fall back to the cold path until per-tenant pool isolation and cleanup guarantees are designed.

### What bakes vs. late-binds

- **Baked into the shared `SandboxTemplate`:** image, mTLS mount, projected SA-token volume, supervisor sideload, capabilities, host aliases, runtimeClass, resources, and any **read-only shared data volume** (large files / datasets).
- **Per-claim, isolated:** the writable per-agent workspace — ephemeral, or a per-Sandbox PVC destroyed on teardown — never the shared data volume.
- **Injected per-claim (annotation only):** `openshell.io/sandbox-id` (per-claim `env` is rejected on the warm path).
- **Late-bound over the supervisor relay (already works):** policy, providers. Identity is established by the existing token exchange, not Pod env.

### Phased plan

1. **Settle this design** (this issue).
2. **Driver warm path (flagged):** create `SandboxClaim` instead of `Sandbox` for pooled shapes; gateway RBAC for `extensions.agents.x-k8s.io`; durable Store claim mapping. Install `extensions.yaml` in dev/e2e **alongside this consumer** (not before).
3. **Auth re-anchor:** implement the fail-closed chain in `k8s_sa.rs` + adversarial tests.
4. **Single-use lifecycle + volume model:** isolate the writable workspace (ephemeral or per-Sandbox PVC destroyed on teardown); mount shared large-file data **read-only**; workspace-isolation e2e (claim → write secret to the writable workspace → delete → re-claim → assert clean + workspace not reused, and assert the shared volume is read-only and carries no per-agent data).
5. **Pool management + surface/docs.**

## Alternatives Considered

- **Patch identity onto the claimed Pod after bind** (keep the label cross-check): requires granting the gateway `patch pods` (deliberately denied for immutability) and is racy.
- **Bare-Pod warm pools** (if upstream pools created Pods, not `Sandbox` CRs — see upstream issue #390): would break the ownerReference auth chain. v0.4.6 creates `Sandbox` CRs.
- **Do nothing:** accept cold-start latency. Viable for low-churn usage, poor for interactive agents.

## Agent Investigation

Validated on a local k3s (k3d) cluster with agent-sandbox `v0.4.6` (core + extensions):

- **Identity:** claim binds in ~0.13s; the warm Pod is owned by a controlling `Sandbox` CR (ownerRef chain intact); the claim-injected `openshell.io/sandbox-id` annotation lands on the Pod; per-claim `env` is rejected on the warm path. The bound `Sandbox` carries `agents.x-k8s.io/claim-uid` + a controlling `SandboxClaim` ownerRef. The current `validate_sandbox_owner_reference()` **fails closed** against warm Sandboxes (they lack `openshell.ai/sandbox-id`), so there is no exploit in the install-only PR.
- **Workspace:** `SandboxTemplate.volumeClaimTemplates` → each warm `Sandbox` gets its **own** `Bound` PVC (2 warm pods → 2 distinct PVCs, each owned by its `Sandbox`). Claiming replenished the pool with a **new** `Sandbox` + **new** PVC (claimed one not recycled). Deleting the claim deleted the `Sandbox` but **left the workspace PVC `Bound`** holding a written `TENANT-A-SECRET` marker — `shutdownPolicy` default is `Retain`, confirming the orphaned-workspace data risk.
- **Baseline:** the cold path is unaffected — `sandbox create` → `Ready`, `IssueSandboxToken` → minted JWT, `echo` executed inside the sandbox over the supervisor relay.

## Checklist

- [x] I've reviewed existing issues and the architecture docs
- [x] This is a design proposal, not a "please build this" request



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warm-pooled sandboxes for the Kubernetes compute driver #1879

Problem Statement

Proposed Design

Identity re-anchor (security-critical)

Workspace isolation: separate the private workspace from shared data

Scope guardrails

What bakes vs. late-binds

Phased plan

Alternatives Considered

Agent Investigation

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Warm-pooled sandboxes for the Kubernetes compute driver #1879

Description

Problem Statement

Proposed Design

Identity re-anchor (security-critical)

Workspace isolation: separate the private workspace from shared data

Scope guardrails

What bakes vs. late-binds

Phased plan

Alternatives Considered

Agent Investigation

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions