Skip to content

feat: Support plugin as a comparison axis in evals#2579

Open
pulpdrew wants to merge 3 commits into
mainfrom
drew/evals-plugin-axis
Open

feat: Support plugin as a comparison axis in evals#2579
pulpdrew wants to merge 3 commits into
mainfrom
drew/evals-plugin-axis

Conversation

@pulpdrew

@pulpdrew pulpdrew commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds support for plugin configs in the MCP evals framework, so that claude plugin versions can be A/B tested (including the no-plugin case). This follows the pattern established by model configuration / comparison.

Screenshots or video

Screenshot 2026-07-02 at 9 53 38 AM Screenshot 2026-07-02 at 10 01 56 AM

How to test on Vercel preview

Testing:

  1. Clone the plugin

  2. Add the plugin to your evals config file

     "plugins": {
       "clickstack": {
         "label": "ClickStack",
         "dir": "/Users/pulpdrew/Documents/repos/clickstack-claude-plugin/plugin"
       }
     },
  3. Run the evals

    yarn workspace @hyperdx/hdx-eval dev run error-root-cause --mcp hyperdx-main,hyperdx-mcp-integration --plugin none,clickstack

References

  • Linear Issue: Closes HDX-4692
  • Related PRs:

@changeset-bot

changeset-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 2f7e64e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel

vercel Bot commented Jul 2, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
hyperdx-oss Ignored Ignored Preview Jul 2, 2026 2:46pm
hyperdx-storybook Skipped Skipped Jul 2, 2026 2:46pm

Request Review

@github-actions github-actions Bot added the review/tier-3 Standard — full human review required label Jul 2, 2026
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

🟡 Tier 3 — Standard

Introduces new logic, modifies core functionality, or touches areas with non-trivial risk.

Why this tier:

  • Diff size: 751 production lines changed (Tier 2 max: < 250)

Review process: Full human review — logic, architecture, edge cases.
SLA: First-pass feedback within 1 business day.

Stats
  • Production files changed: 15
  • Production lines changed: 751 (+ 763 in test files, excluded from tier calculation)
  • Branch: drew/evals-plugin-axis
  • Author: pulpdrew

To override this classification, remove the review/tier-3 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds plugin as a first-class comparison axis in the MCP eval framework, mirroring the existing --model and --mcp flags. When --plugin none,myplugin is passed, every cell is run in both the no-plugin and plugin variants; the resulting runs are stored one level deeper (<model>/<plugin>/) and the aggregate reports gain a plugin dimension alongside the existing model/MCP dimensions.

  • CLI & config: adds --plugin <names> to the run command, plugins config section validation (including the "none" reserved-name guard), and parsePluginFlag/configPluginNames/getPluginDefinition helpers.
  • Harness: pluginCliArgs converts a PluginDefinition (url/dir) into --plugin-url/--plugin-dir args passed to the claude spawn call; the new plugin field is written into every RunRecord.
  • Reports & viewer: columnKeyFor extends to include the plugin dimension; getRunFilesInBatch/server.js are updated to walk the new <model>/<plugin>/ directory level while remaining backward-compatible with legacy batches that lack the plugin level.

Confidence Score: 5/5

Safe to merge — the plugin axis is additive and backward-compatible; existing batches without a plugin level are handled gracefully throughout the stack.

The implementation mirrors the established model/MCP comparison patterns faithfully. All three previous review concerns (optional plugin field typing, the 'none' reserved-name guard, and the mixed-layout viewer false-positive) have been addressed. Backward compatibility with old on-disk layouts is handled in every walker. The plugin ?? PLUGIN_NONE guards are consistent across the aggregate, grade, and report paths. Tests cover column-key generation, config validation, grade layout, and viewer heading logic.

No files require special attention.

Important Files Changed

Filename Overview
packages/hdx-eval/src/harness/types.ts Adds PluginDefinition type, PLUGIN_NONE sentinel constant, and plugin?: string optional field to RunRecord — optional typing correctly reflects backward compat with old on-disk records.
packages/hdx-eval/src/hyperdx/config.ts Adds plugins section to EvalConfig, configPluginNames/getPluginDefinition helpers, and full validation in validateConfig including reserved-name guard for "none".
packages/hdx-eval/src/cli.ts Adds --plugin flag, parsePluginFlag helper with validation against config, and wires plugin through the (mcp, model, plugin, runIndex) cell matrix; baseline column validation extended to all three dimensions.
packages/hdx-eval/src/harness/claudeSpawn.ts Adds pluginCliArgs function that converts PluginDefinition to --plugin-url/--plugin-dir args; args are injected into the claude spawn call before --dangerously-skip-permissions.
packages/hdx-eval/src/runs/path.ts Adds plugin level to runFilePath and a getRunFilesInModelDir helper that walks both the new <model>/<plugin>/ layout and the legacy <model>/ flat layout for backward compat.
packages/hdx-eval/src/reports/aggregate.ts Extends columnKeyFor and buildAggregate to handle the plugin dimension; all plugin field reads use ?? PLUGIN_NONE for backward compat with old records.
packages/hdx-eval/scripts/viewer/server.js Updates listCells to emit plugin-arm cells from <model>/<plugin>/ subdirs, and widens the run-detail route regex from [^/]+ to .+ so multi-segment model paths resolve correctly.
packages/hdx-eval/scripts/viewer/public/app.js Adds cellModel/isMultiPlugin helpers and updates cellHeading to match aggregate.ts column-key logic for multi-plugin batches; model extraction correctly strips the plugin suffix from the composite model field.
packages/hdx-eval/src/harness/runRun.ts Threads plugin through RunCellOptionsrunClaude call and captures it in the RunRecord; runId now includes the plugin segment for uniqueness.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant CLI as cli.ts (run cmd)
    participant Parser as parsePluginFlag
    participant Config as config.ts
    participant RunCell as runCell (harness)
    participant Claude as claudeSpawn (claude CLI)
    participant FS as File System

    CLI->>Parser: parsePluginFlag(--plugin none,myplugin)
    Parser->>Config: configPluginNames(config)
    Config-->>Parser: ['myplugin', ...]
    Parser->>Config: getPluginDefinition(config, 'myplugin')
    Config-->>Parser: "{ label, url/dir }"
    Parser-->>CLI: ['none', 'myplugin']

    loop each (mcp, model, plugin, runIndex)
        CLI->>RunCell: "runCell({ plugin })"
        RunCell->>Config: getPluginDefinition(config, plugin)
        RunCell->>Claude: "runClaude({ pluginDef })"
        Note over Claude: pluginCliArgs(def) → --plugin-url / --plugin-dir
        Claude-->>RunCell: SpawnResult
        RunCell-->>CLI: RunRecord (plugin field set)
        CLI->>FS: writeRun → batch/scenario/mcp/model/plugin/idx.json
    end

    CLI->>CLI: buildAggregate (columnKeyFor per mcp/model/plugin)
    CLI->>FS: writeBatchSummary → _summary.json + _summary.md
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant CLI as cli.ts (run cmd)
    participant Parser as parsePluginFlag
    participant Config as config.ts
    participant RunCell as runCell (harness)
    participant Claude as claudeSpawn (claude CLI)
    participant FS as File System

    CLI->>Parser: parsePluginFlag(--plugin none,myplugin)
    Parser->>Config: configPluginNames(config)
    Config-->>Parser: ['myplugin', ...]
    Parser->>Config: getPluginDefinition(config, 'myplugin')
    Config-->>Parser: "{ label, url/dir }"
    Parser-->>CLI: ['none', 'myplugin']

    loop each (mcp, model, plugin, runIndex)
        CLI->>RunCell: "runCell({ plugin })"
        RunCell->>Config: getPluginDefinition(config, plugin)
        RunCell->>Claude: "runClaude({ pluginDef })"
        Note over Claude: pluginCliArgs(def) → --plugin-url / --plugin-dir
        Claude-->>RunCell: SpawnResult
        RunCell-->>CLI: RunRecord (plugin field set)
        CLI->>FS: writeRun → batch/scenario/mcp/model/plugin/idx.json
    end

    CLI->>CLI: buildAggregate (columnKeyFor per mcp/model/plugin)
    CLI->>FS: writeBatchSummary → _summary.json + _summary.md
Loading

Reviews (3): Last reviewed commit: "review: Address baseline selection incon..." | Re-trigger Greptile

Comment thread packages/hdx-eval/src/harness/types.ts
Comment thread packages/hdx-eval/src/hyperdx/config.ts
Comment thread packages/hdx-eval/scripts/viewer/public/app.js
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

All tests passed • 224 passed • 3 skipped • 1413s

Status Count
✅ Passed 224
❌ Failed 0
⚠️ Flaky 3
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook July 2, 2026 14:16 Inactive
@pulpdrew pulpdrew requested review from a team and fleon and removed request for a team July 2, 2026 14:27
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Deep Review

No critical issues found. The plugin-axis feature is well-factored: the four duplicated batch-walkers are consolidated into a single tested getRunFilesInBatch, escapeDirSegment cleanly generalizes modelDirName, every RunRecord.plugin read is guarded with ?? PLUGIN_NONE, and the new config/aggregate/path logic carries strong unit coverage. No correctness, security, or data-loss defects tie to the happy path.

🟡 P2 — recommended

  • packages/hdx-eval/src/cli.ts:1063parsePluginFlag has non-trivial branching (default-none, empty/whitespace input, dedup/order-preserve, none passthrough, unknown-name throw, per-name definition validation) but no direct test, and the run command that calls it is not otherwise exercised.
    • Fix: Export parsePluginFlag and add cases covering the default, none,<name>, dedup, unknown-name error, and whitespace trimming.
    • testing
🔵 P3 nitpicks (5)
  • packages/hdx-eval/src/hyperdx/config.ts:240 — The "exactly one of url/dir" rule is implemented twice (validateConfig and getPluginDefinition) with divergent error strings, and the tests pin both spellings, so a change to one silently misses the other.
    • Fix: Extract one validatePluginDefinition(name, def) helper and call it from both sites.
    • kieran-typescript, maintainability
  • packages/hdx-eval/src/grading/grade.ts:114 — The new multiModel/multiPlugin pre-pass calls readRun on every run file, then the grading loop parses each file again, doubling JSON I/O on large batches.
    • Fix: Derive the model/plugin sets from records already parsed in the grading loop, or memoize the parsed records.
    • kieran-typescript
  • packages/hdx-eval/src/reports/aggregate.ts:145 — When no explicit or persisted baseline is present, buildAggregate falls back to the alphabetically-first column, so a bare report on a batch whose _summary.json is absent designates the plugin (not none) as baseline, inverting delta signs relative to the documented default; deltas stay internally correct and the baseline is labeled.
    • Fix: Prefer the no-plugin / first-listed variant as the fallback baseline instead of columnOrder[0].
    • correctness
  • packages/app/src/components/TimelineChart/TimelineChart.stories.tsx:70 — Unrelated to the plugin axis: this PR also deletes a changeset and rewrites the story's sample-trace generator, and the new generator sets childCount on flat rows without emitting matching child rows.
    • Fix: Split the story/changeset change into its own PR and confirm the new fixture still exercises the intended waterfall cases.
    • maintainability
  • packages/hdx-eval/scripts/viewer/server.js:95 — The viewer's listCells walker and cellHeading/isMultiPlugin/cellModel hand-reimplement getRunFilesInBatch and columnKeyFor in browser/Node JS with no tests and a widened greedy (.+) route regex, so a layout or key-format change can silently desync the viewer from the reports.
    • Fix: Add a fixture test asserting the JS walker/heading agree with the TS versions, or have the server expose the computed columnKey per cell so the viewer stops recomputing it.
    • testing, maintainability

Reviewers (4): correctness, kieran-typescript, testing, maintainability.

Testing gaps:

  • parsePluginFlag unknown-name/dedup/whitespace branches are untested (see P2).
  • No test covers escapeDirSegment collisions between distinct model/plugin names that sanitize to the same segment (e.g. my.plugin vs my-plugin), which would silently overwrite same-index runs on disk — pathological input, and the same class already applies to --model.
  • The viewer listCells walker and run-detail route regex have no unit coverage.

@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook July 2, 2026 14:46 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-3 Standard — full human review required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant