Skip to content

fix(execution): size the wasm runner V8 heap so warmup stops OOMing#129

Merged
NathanFlurry merged 1 commit into
mainfrom
fix/wasm-runner-heap-oom
Jun 25, 2026
Merged

fix(execution): size the wasm runner V8 heap so warmup stops OOMing#129
NathanFlurry merged 1 commit into
mainfrom
fix/wasm-runner-heap-oom

Conversation

@NathanFlurry

@NathanFlurry NathanFlurry commented Jun 25, 2026

Copy link
Copy Markdown
Member

Problem

WASM command warmup fails with the opaque ERR_AGENTOS_NODE_SYNC_RPC: WebAssembly warmup exited with status 1 (Error: null). Root cause:

  • The wasm runner isolate is started with JavascriptExecutionLimits::default() (crates/execution/src/wasm.rs), so its V8 heap falls back to isolate::DEFAULT_HEAP_LIMIT_MB = 128 MiB — the per-guest-isolate budget (mirrors Cloudflare Workers).
  • But the runner is trusted infrastructure that must compile the WASI runtime + the guest's wasm module (e.g. bash.wasm) into its own heap before the guest runs. That routinely exceeds 128 MiB.
  • The near_heap_limit_callback then terminates the isolate with an uncatchable, message-less exception → warmup dies → surfaces as status 1 (Error: null). Sidecar log shows warn "bounded limit exhausted" limit=v8_heap_bytes observed=131072000 capacity=131072000 fill_percent=100. A clean release hits this too.

Fix

Size the runner heap explicitly — default 2048 MiB, operator-tunable via AGENTOS_WASM_RUNNER_HEAP_LIMIT_MB — instead of leaving it on the per-guest default:

limits: JavascriptExecutionLimits {
    v8_heap_limit_mb: Some(wasm_runner_heap_limit_mb(request)),
    ..JavascriptExecutionLimits::default()
},

This does not weaken guest isolation. The guest module's memory/fuel/stack stay bounded separately, Rust-side, from request.limits (AGENTOS_WASM_MAX_MEMORY_BYTES etc.). The value is a V8 heap ceiling (heap_limits(0, cap)), committed only as used.

Test

wasm_runner_heap_limit_defaults_and_honors_operator_override covers the bounded default (asserts > 128), a positive operator override, and zero/non-numeric fallback to default. (A live OOM-repro needs a >128 MiB module + wasm artifacts, so it's not a deterministic unit test; the resolver + call-site change are covered.)

@railway-app railway-app Bot temporarily deployed to secure-exec / secure-exec-pr-129 June 25, 2026 19:30 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / secure-exec-pr-129 June 25, 2026 19:30 Destroyed
@NathanFlurry NathanFlurry force-pushed the fix/wasm-runner-heap-oom branch from 9a3cf86 to 28c6540 Compare June 25, 2026 19:37
@railway-app railway-app Bot temporarily deployed to rivet-frontend / secure-exec-pr-129 June 25, 2026 19:37 Destroyed
@railway-app railway-app Bot temporarily deployed to secure-exec / secure-exec-pr-129 June 25, 2026 19:37 Destroyed
The wasm runner isolate is started with JavascriptExecutionLimits::default(), so
its V8 heap falls back to isolate::DEFAULT_HEAP_LIMIT_MB (128 MiB -- the per-GUEST
isolate budget). But the runner is trusted infrastructure that must compile the
WASI runtime + the guest's wasm module (e.g. bash.wasm) into its own heap before
the guest runs, and that routinely exceeds 128 MiB. The near-heap-limit guard then
terminates the isolate with an uncatchable, message-less exception, so warmup dies
and surfaces as the opaque 'WebAssembly warmup exited with status 1 (Error: null)'
(ERR_AGENTOS_NODE_SYNC_RPC). A clean release hits this too.

Size the runner heap explicitly (default 2048 MiB, operator-tunable via
AGENTOS_WASM_RUNNER_HEAP_LIMIT_MB) instead of leaving it on the per-guest default.
This does NOT weaken guest isolation: the guest module's memory/fuel/stack stay
bounded separately, Rust-side, from request.limits. The value is a V8 heap ceiling
(heap_limits(0, cap)), committed only as used.

Unit test wasm_runner_heap_limit_defaults_and_honors_operator_override covers the
default (> 128), a positive override, and zero/non-numeric fallback.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@NathanFlurry NathanFlurry force-pushed the fix/wasm-runner-heap-oom branch from 28c6540 to 892cb8e Compare June 25, 2026 20:26
@railway-app railway-app Bot temporarily deployed to rivet-frontend / secure-exec-pr-129 June 25, 2026 20:26 Destroyed
@railway-app railway-app Bot temporarily deployed to secure-exec / secure-exec-pr-129 June 25, 2026 20:26 Destroyed
@railway-app

railway-app Bot commented Jun 25, 2026

Copy link
Copy Markdown

🚅 Deployed to the secure-exec-pr-129 environment in rivet-frontend

Service Status Web Updated (UTC)
secure-exec 😴 Sleeping (View Logs) Jun 25, 2026 at 8:33 pm

🚅 Deployed to the secure-exec-pr-129 environment in secure-exec

Service Status Web Updated (UTC)
secure-exec 😴 Sleeping (View Logs) Web Jun 25, 2026 at 8:32 pm

@NathanFlurry NathanFlurry merged commit 02564e2 into main Jun 25, 2026
2 of 3 checks passed
NathanFlurry added a commit that referenced this pull request Jun 26, 2026
…sponses, fix service-test build

Fixes surfaced while syncing agent-os against latest secure-exec main:

1. limits: classify DEFAULT_WASM_RUNNER_HEAP_LIMIT_MB (#129) and MAX_TIMER_DELAY_MS
   (#131) — both added without inventory entries, so limits_audit failed on main.
2. sidecar: accept_sidecar_response drops a stale sidecar_response with no matching
   pending request (UnmatchedResponse) or already completed (DuplicateResponse)
   instead of failing the whole sidecar — a per-VM callback can be answered by the
   host after that VM is disposed on the shared sidecar process. Real protocol
   violations stay fatal.
3. tests: re-export crate::EventSinkTransport into the source-included service test
   crate (#132 added the use in src/service.rs without the matching test re-export,
   breaking 'cargo test -p secure-exec-sidecar --test service' compilation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NathanFlurry added a commit that referenced this pull request Jun 26, 2026
…sponses, fix service-test build (#133)

Fixes surfaced while syncing agent-os against latest secure-exec main:

1. limits: classify DEFAULT_WASM_RUNNER_HEAP_LIMIT_MB (#129) and MAX_TIMER_DELAY_MS
   (#131) — both added without inventory entries, so limits_audit failed on main.
2. sidecar: accept_sidecar_response drops a stale sidecar_response with no matching
   pending request (UnmatchedResponse) or already completed (DuplicateResponse)
   instead of failing the whole sidecar — a per-VM callback can be answered by the
   host after that VM is disposed on the shared sidecar process. Real protocol
   violations stay fatal.
3. tests: re-export crate::EventSinkTransport into the source-included service test
   crate (#132 added the use in src/service.rs without the matching test re-export,
   breaking 'cargo test -p secure-exec-sidecar --test service' compilation).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant