Skip to content

fix(sidecar): shadow-walk skip + bound undici pool (net-bridge listener leak)#128

Merged
NathanFlurry merged 1 commit into
mainfrom
fix/sidecar-fs-net-perf
Jun 25, 2026
Merged

fix(sidecar): shadow-walk skip + bound undici pool (net-bridge listener leak)#128
NathanFlurry merged 1 commit into
mainfrom
fix/sidecar-fs-net-perf

Conversation

@NathanFlurry

@NathanFlurry NathanFlurry commented Jun 25, 2026

Copy link
Copy Markdown
Member

Three native-sidecar fixes behind the long-lived-VM latency/stall reports.

1. Read-side fs ops re-walk the entire shadow tree

Every read-side guest fs op (Exists/Stat/Lstat/ReadFile/Pread) reconciles the host shadow tree into the kernel VFS, re-reading + re-writing every file on every op → O(whole tree), super-linear. Fix: an rsync-style (size, mode, mtime) lstat skip for unchanged files (self-correcting, no cache). Test read_side_ops_skip_unchanged_shadow_files: warm read over an unchanged 800-file tree ≥4× cheaper than cold (cold=22.7s, warm=28ms debug).

2. EventEmitter shim fires a spurious MaxListenersExceededWarning

ensureEventEmitterInitialized() defaulted _maxListenersWarned but not _maxListeners, so emitters that acquire _events outside our constructor (undici's Client/Pool/Agent) had _maxListeners === undefinedtotal <= undefined is false → warned on the first listener (count "1"). Fix: default _maxListeners.

3. undici client-per-request leak — the real net-bridge leak

The bridge's UndiciAgent was created with an unbounded per-origin pool. Requests that overlap while clients are still connecting each find every client kNeedDrain and spawn a fresh Client+socket — and the HTTPS LLM path is HTTP/2 (ALPN), so each spawn is a whole new h2 session. The synchronous bridge reads widen that connect window (the #122 per-payload macrotask yield only helps h1 same-socket reuse, nothing for the h2 connect-window herd). Over a long many-call turn the abandoned clients accumulate connect/close/drain/error/finish/readable/end/terminated listeners without bound → http2 degradation → "Request was aborted."

Fix: bound connections (6, browser-like; HTTP/2 multiplexes within each) so excess requests queue on existing clients instead of spawning new ones.

Note #2 alone (shipped in an earlier rev of this PR) only silenced the spurious warning; the deeper accumulation in #3 is the actual stall, found in review when a strengthened test surfaced it.

Test

keepalive_no_listener_leak now drives concurrent requests (the trigger) and asserts the host-side connection count stays bounded: 160 requests over 6 connections with the cap vs 16 (≈2× concurrency) without it. A guest process.on("warning") handler surfaces any MaxListenersExceededWarning to stderr (the warning alone is insufficient — leaked listeners spread across per-client emitters, so no single one need cross the threshold while sockets grow).

agent-os client-side companion (recursive mkdir): rivet-dev/agentos#1532. Separate sidecar fix (wasm runner heap OOM): #129.

🤖 Generated with Claude Code

@railway-app railway-app Bot temporarily deployed to secure-exec / secure-exec-pr-128 June 25, 2026 12:17 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / secure-exec-pr-128 June 25, 2026 12:17 Destroyed
@NathanFlurry NathanFlurry force-pushed the fix/sidecar-fs-net-perf branch from 3b400b8 to 119415e Compare June 25, 2026 12:38
@railway-app railway-app Bot temporarily deployed to secure-exec / secure-exec-pr-128 June 25, 2026 12:39 Destroyed
@railway-app railway-app Bot temporarily deployed to rivet-frontend / secure-exec-pr-128 June 25, 2026 12:39 Destroyed
@NathanFlurry NathanFlurry changed the title fix(sidecar): eliminate per-read shadow-tree re-walk (+ net-bridge listener leak) fix(sidecar): eliminate per-read shadow-tree re-walk (+ net-bridge listener-leak regression guard) Jun 25, 2026
@railway-app

railway-app Bot commented Jun 25, 2026

Copy link
Copy Markdown

🚅 Deployed to the secure-exec-pr-128 environment in rivet-frontend

Service Status Web Updated (UTC)
secure-exec ✅ Success (View Logs) Jun 25, 2026 at 8:09 pm

🚅 Deployed to the secure-exec-pr-128 environment in secure-exec

Service Status Web Updated (UTC)
secure-exec 😴 Sleeping (View Logs) Web Jun 25, 2026 at 8:15 pm

@NathanFlurry NathanFlurry force-pushed the fix/sidecar-fs-net-perf branch from 119415e to 20ec8bc Compare June 25, 2026 18:59
@railway-app railway-app Bot temporarily deployed to rivet-frontend / secure-exec-pr-128 June 25, 2026 18:59 Destroyed
@railway-app railway-app Bot temporarily deployed to secure-exec / secure-exec-pr-128 June 25, 2026 18:59 Destroyed
@NathanFlurry NathanFlurry changed the title fix(sidecar): eliminate per-read shadow-tree re-walk (+ net-bridge listener-leak regression guard) fix(sidecar): shadow-walk skip + EventEmitter-shim spurious listener-leak warning Jun 25, 2026
@NathanFlurry NathanFlurry force-pushed the fix/sidecar-fs-net-perf branch from 20ec8bc to e2185d6 Compare June 25, 2026 20:09
@railway-app railway-app Bot temporarily deployed to rivet-frontend / secure-exec-pr-128 June 25, 2026 20:09 Destroyed
@railway-app railway-app Bot temporarily deployed to secure-exec / secure-exec-pr-128 June 25, 2026 20:09 Destroyed
@NathanFlurry NathanFlurry changed the title fix(sidecar): shadow-walk skip + EventEmitter-shim spurious listener-leak warning fix(sidecar): shadow-walk skip + bound undici pool (net-bridge listener leak) Jun 25, 2026
…idge leak

Three native-sidecar fixes behind long-lived VM latency/stall.

1) Read-side shadow-tree re-walk
Read-side guest fs ops (Exists/Stat/Lstat/ReadFile/Pread) reconcile the host
shadow tree into the kernel VFS, re-reading+re-writing EVERY file on EVERY op
(O(whole tree), super-linear). Add an rsync-style (size,mode,mtime) lstat skip for
unchanged files. Test read_side_ops_skip_unchanged_shadow_files: warm read over an
unchanged 800-file tree >4x cheaper than cold (cold=22.7s, warm=28ms debug).

2) EventEmitter shim: spurious MaxListenersExceededWarning
ensureEventEmitterInitialized() defaulted _maxListenersWarned but not
_maxListeners, so emitters that acquire _events outside our ctor (undici's
Client/Pool/Agent) had _maxListeners=undefined and 'total <= undefined' warned on
the FIRST listener. Default _maxListeners so the threshold is meaningful.

3) undici client-per-request leak (the real net-bridge leak)
The bridge's UndiciAgent was created with an UNBOUNDED per-origin pool. Requests
that overlap while clients are still connecting each find every client kNeedDrain
and spawn a fresh Client+socket -- and for HTTPS the LLM path is HTTP/2 (ALPN), so
each spawn is a whole new h2 session. The synchronous bridge reads widen that
connect window (the #122 per-payload macrotask yield only helps h1 same-socket
reuse, nothing for the h2 connect-window herd). Over a long many-call turn the
abandoned clients accumulate connect/close/drain/error/finish/readable/end/
terminated listeners without bound -> http2 degradation -> 'Request was aborted.'
Bound connections (6, browser-like; h2 multiplexes within each) so excess requests
queue on existing clients instead of spawning new ones.

Test keepalive_no_listener_leak now drives CONCURRENT requests (the trigger) and
asserts the host-side connection count stays bounded: 160 requests over 6
connections with the cap vs 16 (2x concurrency) without it. A process.on('warning')
handler surfaces any MaxListenersExceededWarning to stderr (the warning alone is
insufficient -- leaked listeners spread across per-client emitters).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@NathanFlurry NathanFlurry force-pushed the fix/sidecar-fs-net-perf branch from e2185d6 to c1a67a7 Compare June 25, 2026 20:25
@railway-app railway-app Bot temporarily deployed to rivet-frontend / secure-exec-pr-128 June 25, 2026 20:25 Destroyed
@railway-app railway-app Bot temporarily deployed to secure-exec / secure-exec-pr-128 June 25, 2026 20:25 Destroyed
@NathanFlurry NathanFlurry merged commit 1225be9 into main Jun 25, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant