fix(sidecar): shadow-walk skip + bound undici pool (net-bridge listener leak)#128
Merged
Conversation
3b400b8 to
119415e
Compare
|
🚅 Deployed to the secure-exec-pr-128 environment in rivet-frontend
🚅 Deployed to the secure-exec-pr-128 environment in secure-exec
|
119415e to
20ec8bc
Compare
20ec8bc to
e2185d6
Compare
…idge leak Three native-sidecar fixes behind long-lived VM latency/stall. 1) Read-side shadow-tree re-walk Read-side guest fs ops (Exists/Stat/Lstat/ReadFile/Pread) reconcile the host shadow tree into the kernel VFS, re-reading+re-writing EVERY file on EVERY op (O(whole tree), super-linear). Add an rsync-style (size,mode,mtime) lstat skip for unchanged files. Test read_side_ops_skip_unchanged_shadow_files: warm read over an unchanged 800-file tree >4x cheaper than cold (cold=22.7s, warm=28ms debug). 2) EventEmitter shim: spurious MaxListenersExceededWarning ensureEventEmitterInitialized() defaulted _maxListenersWarned but not _maxListeners, so emitters that acquire _events outside our ctor (undici's Client/Pool/Agent) had _maxListeners=undefined and 'total <= undefined' warned on the FIRST listener. Default _maxListeners so the threshold is meaningful. 3) undici client-per-request leak (the real net-bridge leak) The bridge's UndiciAgent was created with an UNBOUNDED per-origin pool. Requests that overlap while clients are still connecting each find every client kNeedDrain and spawn a fresh Client+socket -- and for HTTPS the LLM path is HTTP/2 (ALPN), so each spawn is a whole new h2 session. The synchronous bridge reads widen that connect window (the #122 per-payload macrotask yield only helps h1 same-socket reuse, nothing for the h2 connect-window herd). Over a long many-call turn the abandoned clients accumulate connect/close/drain/error/finish/readable/end/ terminated listeners without bound -> http2 degradation -> 'Request was aborted.' Bound connections (6, browser-like; h2 multiplexes within each) so excess requests queue on existing clients instead of spawning new ones. Test keepalive_no_listener_leak now drives CONCURRENT requests (the trigger) and asserts the host-side connection count stays bounded: 160 requests over 6 connections with the cap vs 16 (2x concurrency) without it. A process.on('warning') handler surfaces any MaxListenersExceededWarning to stderr (the warning alone is insufficient -- leaked listeners spread across per-client emitters). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
e2185d6 to
c1a67a7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three native-sidecar fixes behind the long-lived-VM latency/stall reports.
1. Read-side fs ops re-walk the entire shadow tree
Every read-side guest fs op (
Exists/Stat/Lstat/ReadFile/Pread) reconciles the host shadow tree into the kernel VFS, re-reading + re-writing every file on every op → O(whole tree), super-linear. Fix: an rsync-style (size, mode, mtime)lstatskip for unchanged files (self-correcting, no cache). Testread_side_ops_skip_unchanged_shadow_files: warm read over an unchanged 800-file tree ≥4× cheaper than cold (cold=22.7s, warm=28ms debug).2. EventEmitter shim fires a spurious
MaxListenersExceededWarningensureEventEmitterInitialized()defaulted_maxListenersWarnedbut not_maxListeners, so emitters that acquire_eventsoutside our constructor (undici'sClient/Pool/Agent) had_maxListeners === undefined→total <= undefinedis false → warned on the first listener (count "1"). Fix: default_maxListeners.3. undici client-per-request leak — the real net-bridge leak
The bridge's
UndiciAgentwas created with an unbounded per-origin pool. Requests that overlap while clients are still connecting each find every clientkNeedDrainand spawn a fresh Client+socket — and the HTTPS LLM path is HTTP/2 (ALPN), so each spawn is a whole new h2 session. The synchronous bridge reads widen that connect window (the #122 per-payload macrotask yield only helps h1 same-socket reuse, nothing for the h2 connect-window herd). Over a long many-call turn the abandoned clients accumulateconnect/close/drain/error/finish/readable/end/terminatedlisteners without bound → http2 degradation → "Request was aborted."Fix: bound
connections(6, browser-like; HTTP/2 multiplexes within each) so excess requests queue on existing clients instead of spawning new ones.Test
keepalive_no_listener_leaknow drives concurrent requests (the trigger) and asserts the host-side connection count stays bounded: 160 requests over 6 connections with the cap vs 16 (≈2× concurrency) without it. A guestprocess.on("warning")handler surfaces anyMaxListenersExceededWarningto stderr (the warning alone is insufficient — leaked listeners spread across per-client emitters, so no single one need cross the threshold while sockets grow).🤖 Generated with Claude Code