Skip to content

[Bug] MCP server connect leak: stuck stdio servers spawn unbounded child processes (CPU/lag) #3698

@Mcrewe

Description

@Mcrewe

Describe the bug

When a configured stdio MCP server is slow or its upstream is unreachable, Copilot CLI spawns its child process but never reaps it, then repeatedly re-spawns on restart/reconnect. Child processes accumulate without bound, pinning CPU and degrading the whole machine. Observed in an autopilot session: a single copilot.exe had spawned 180+ children and was still climbing; the system reached 1,135 total processes. Killing the process tree dropped it to 416 processes and CPU from 66% to 12%.

Affected version

CLI 1.0.59

Steps to reproduce the behavior

  1. Configure a stdio MCP server whose upstream is slow/unreachable but still emits progress during initialize/tools/list (e.g. a kusto-style proxy pointed at an unreachable cluster).
  2. Run a long autopilot session so restart/reconnect fires repeatedly.
  3. Watch child process count of copilot.exe climb without bound; CPU rises and the machine lags.

Expected behavior

(That doesn't happen)

Additional context

Diagnosed the bug through a separate CLI session, here are the results/suggested fix:

Root cause (from source, src/mcp-client/)

  • mcp-registry.ts getConnectOptions()resetTimeoutOnProgress: true with no maxTotalTimeout: progress keeps resetting the per-request timer, so client.connect() never rejects and the SDK's reap-on-failure (void this.close()) never fires.
  • mcp-host.ts isServerRunning() returns name in transports && name in clients: a server stuck mid-connect has a transport (+ child) but no client → reports "not running" → startServer skips stopServer → next start overwrites the transport ref and orphans the prior child.
  • mcp-registry.ts reconnect (attemptReconnect) is isRemote-only; local stdio servers get raw re-spawn with no bounded backoff.

Suggested fix

  • Add a maxTotalTimeout to getConnectOptions() so a stalled connect eventually rejects and the child is reaped.
  • Treat pending/stuck connections as reapable: close the transport (kill the child) for a server that is mid-connect before re-spawning.
  • Apply bounded backoff to local stdio reconnects, not just remote.

OS: Windows (AMD EPYC 7763, 16 cores, 64 GB). CLI version: 1.0.59 (latest 1.0.60). Full per-spawn logs available: 216 kusto / 118 msft-learn / 114 icm proxy spawns in a single session; only ~79/216 logged shutdown. Happy to attach session logs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:mcpMCP server configuration, discovery, connectivity, OAuth, policy, and registry

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions