feat: Support plugin as a comparison axis in evals by pulpdrew · Pull Request #2579 · hyperdxio/hyperdx

pulpdrew · 2026-07-02T14:04:29Z

Summary

This PR adds support for plugin configs in the MCP evals framework, so that claude plugin versions can be A/B tested (including the no-plugin case). This follows the pattern established by model configuration / comparison.

Screenshots or video

How to test on Vercel preview

Testing:

Clone the plugin

Add the plugin to your evals config file

 "plugins": {
   "clickstack": {
     "label": "ClickStack",
     "dir": "/Users/pulpdrew/Documents/repos/clickstack-claude-plugin/plugin"
   }
 },

Run the evals

yarn workspace @hyperdx/hdx-eval dev run error-root-cause --mcp hyperdx-main,hyperdx-mcp-integration --plugin none,clickstack

References

Linear Issue: Closes HDX-4692
Related PRs:

changeset-bot · 2026-07-02T14:04:35Z

⚠️ No Changeset found

Latest commit: 2f7e64e

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

vercel · 2026-07-02T14:04:37Z

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
hyperdx-oss	Ignored	Preview	Jul 2, 2026 2:46pm
hyperdx-storybook	Skipped		Jul 2, 2026 2:46pm

github-actions · 2026-07-02T14:04:58Z

🟡 Tier 3 — Standard

Introduces new logic, modifies core functionality, or touches areas with non-trivial risk.

Why this tier:

Diff size: 751 production lines changed (Tier 2 max: < 250)

Review process: Full human review — logic, architecture, edge cases.
SLA: First-pass feedback within 1 business day.

Stats

Production files changed: 15
Production lines changed: 751 (+ 763 in test files, excluded from tier calculation)
Branch: drew/evals-plugin-axis
Author: pulpdrew

To override this classification, remove the review/tier-3 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

greptile-apps · 2026-07-02T14:09:59Z

Greptile Summary

This PR adds plugin as a first-class comparison axis in the MCP eval framework, mirroring the existing --model and --mcp flags. When --plugin none,myplugin is passed, every cell is run in both the no-plugin and plugin variants; the resulting runs are stored one level deeper (<model>/<plugin>/) and the aggregate reports gain a plugin dimension alongside the existing model/MCP dimensions.

CLI & config: adds --plugin <names> to the run command, plugins config section validation (including the "none" reserved-name guard), and parsePluginFlag/configPluginNames/getPluginDefinition helpers.
Harness: pluginCliArgs converts a PluginDefinition (url/dir) into --plugin-url/--plugin-dir args passed to the claude spawn call; the new plugin field is written into every RunRecord.
Reports & viewer: columnKeyFor extends to include the plugin dimension; getRunFilesInBatch/server.js are updated to walk the new <model>/<plugin>/ directory level while remaining backward-compatible with legacy batches that lack the plugin level.

Confidence Score: 5/5

Safe to merge — the plugin axis is additive and backward-compatible; existing batches without a plugin level are handled gracefully throughout the stack.

The implementation mirrors the established model/MCP comparison patterns faithfully. All three previous review concerns (optional plugin field typing, the 'none' reserved-name guard, and the mixed-layout viewer false-positive) have been addressed. Backward compatibility with old on-disk layouts is handled in every walker. The plugin ?? PLUGIN_NONE guards are consistent across the aggregate, grade, and report paths. Tests cover column-key generation, config validation, grade layout, and viewer heading logic.

No files require special attention.

Important Files Changed

Filename	Overview
packages/hdx-eval/src/harness/types.ts	Adds `PluginDefinition` type, `PLUGIN_NONE` sentinel constant, and `plugin?: string` optional field to `RunRecord` — optional typing correctly reflects backward compat with old on-disk records.
packages/hdx-eval/src/hyperdx/config.ts	Adds `plugins` section to `EvalConfig`, `configPluginNames`/`getPluginDefinition` helpers, and full validation in `validateConfig` including reserved-name guard for "none".
packages/hdx-eval/src/cli.ts	Adds `--plugin` flag, `parsePluginFlag` helper with validation against config, and wires plugin through the (mcp, model, plugin, runIndex) cell matrix; baseline column validation extended to all three dimensions.
packages/hdx-eval/src/harness/claudeSpawn.ts	Adds `pluginCliArgs` function that converts `PluginDefinition` to `--plugin-url`/`--plugin-dir` args; args are injected into the `claude` spawn call before `--dangerously-skip-permissions`.
packages/hdx-eval/src/runs/path.ts	Adds plugin level to `runFilePath` and a `getRunFilesInModelDir` helper that walks both the new `<model>/<plugin>/` layout and the legacy `<model>/` flat layout for backward compat.
packages/hdx-eval/src/reports/aggregate.ts	Extends `columnKeyFor` and `buildAggregate` to handle the plugin dimension; all `plugin` field reads use `?? PLUGIN_NONE` for backward compat with old records.
packages/hdx-eval/scripts/viewer/server.js	Updates `listCells` to emit plugin-arm cells from `<model>/<plugin>/` subdirs, and widens the run-detail route regex from `[^/]+` to `.+` so multi-segment model paths resolve correctly.
packages/hdx-eval/scripts/viewer/public/app.js	Adds `cellModel`/`isMultiPlugin` helpers and updates `cellHeading` to match `aggregate.ts` column-key logic for multi-plugin batches; model extraction correctly strips the plugin suffix from the composite `model` field.
packages/hdx-eval/src/harness/runRun.ts	Threads `plugin` through `RunCellOptions` → `runClaude` call and captures it in the `RunRecord`; `runId` now includes the plugin segment for uniqueness.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant CLI as cli.ts (run cmd)
    participant Parser as parsePluginFlag
    participant Config as config.ts
    participant RunCell as runCell (harness)
    participant Claude as claudeSpawn (claude CLI)
    participant FS as File System

    CLI->>Parser: parsePluginFlag(--plugin none,myplugin)
    Parser->>Config: configPluginNames(config)
    Config-->>Parser: ['myplugin', ...]
    Parser->>Config: getPluginDefinition(config, 'myplugin')
    Config-->>Parser: "{ label, url/dir }"
    Parser-->>CLI: ['none', 'myplugin']

    loop each (mcp, model, plugin, runIndex)
        CLI->>RunCell: "runCell({ plugin })"
        RunCell->>Config: getPluginDefinition(config, plugin)
        RunCell->>Claude: "runClaude({ pluginDef })"
        Note over Claude: pluginCliArgs(def) → --plugin-url / --plugin-dir
        Claude-->>RunCell: SpawnResult
        RunCell-->>CLI: RunRecord (plugin field set)
        CLI->>FS: writeRun → batch/scenario/mcp/model/plugin/idx.json
    end

    CLI->>CLI: buildAggregate (columnKeyFor per mcp/model/plugin)
    CLI->>FS: writeBatchSummary → _summary.json + _summary.md

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant CLI as cli.ts (run cmd)
    participant Parser as parsePluginFlag
    participant Config as config.ts
    participant RunCell as runCell (harness)
    participant Claude as claudeSpawn (claude CLI)
    participant FS as File System

    CLI->>Parser: parsePluginFlag(--plugin none,myplugin)
    Parser->>Config: configPluginNames(config)
    Config-->>Parser: ['myplugin', ...]
    Parser->>Config: getPluginDefinition(config, 'myplugin')
    Config-->>Parser: "{ label, url/dir }"
    Parser-->>CLI: ['none', 'myplugin']

    loop each (mcp, model, plugin, runIndex)
        CLI->>RunCell: "runCell({ plugin })"
        RunCell->>Config: getPluginDefinition(config, plugin)
        RunCell->>Claude: "runClaude({ pluginDef })"
        Note over Claude: pluginCliArgs(def) → --plugin-url / --plugin-dir
        Claude-->>RunCell: SpawnResult
        RunCell-->>CLI: RunRecord (plugin field set)
        CLI->>FS: writeRun → batch/scenario/mcp/model/plugin/idx.json
    end

    CLI->>CLI: buildAggregate (columnKeyFor per mcp/model/plugin)
    CLI->>FS: writeBatchSummary → _summary.json + _summary.md

_{Reviews (3): Last reviewed commit: "review: Address baseline selection incon..." | Re-trigger Greptile}

github-actions · 2026-07-02T14:14:55Z

E2E Test Results

✅ All tests passed • 224 passed • 3 skipped • 1413s

Status	Count
✅ Passed	224
❌ Failed	0
⚠️ Flaky	3
⏭️ Skipped	3

Tests ran across 4 shards in parallel.

View full report →

github-actions · 2026-07-02T14:28:06Z

Deep Review

✅ No critical issues found. The plugin-axis feature is well-factored: the four duplicated batch-walkers are consolidated into a single tested getRunFilesInBatch, escapeDirSegment cleanly generalizes modelDirName, every RunRecord.plugin read is guarded with ?? PLUGIN_NONE, and the new config/aggregate/path logic carries strong unit coverage. No correctness, security, or data-loss defects tie to the happy path.

🟡 P2 — recommended

packages/hdx-eval/src/cli.ts:1063 — parsePluginFlag has non-trivial branching (default-none, empty/whitespace input, dedup/order-preserve, none passthrough, unknown-name throw, per-name definition validation) but no direct test, and the run command that calls it is not otherwise exercised.
- Fix: Export parsePluginFlag and add cases covering the default, none,<name>, dedup, unknown-name error, and whitespace trimming.
- _testing

🔵 P3 nitpicks (5)

packages/hdx-eval/src/hyperdx/config.ts:240 — The "exactly one of url/dir" rule is implemented twice (validateConfig and getPluginDefinition) with divergent error strings, and the tests pin both spellings, so a change to one silently misses the other.
- Fix: Extract one validatePluginDefinition(name, def) helper and call it from both sites.
- _{kieran-typescript, maintainability}
packages/hdx-eval/src/grading/grade.ts:114 — The new multiModel/multiPlugin pre-pass calls readRun on every run file, then the grading loop parses each file again, doubling JSON I/O on large batches.
- Fix: Derive the model/plugin sets from records already parsed in the grading loop, or memoize the parsed records.
- _{kieran-typescript}
packages/hdx-eval/src/reports/aggregate.ts:145 — When no explicit or persisted baseline is present, buildAggregate falls back to the alphabetically-first column, so a bare report on a batch whose _summary.json is absent designates the plugin (not none) as baseline, inverting delta signs relative to the documented default; deltas stay internally correct and the baseline is labeled.
- Fix: Prefer the no-plugin / first-listed variant as the fallback baseline instead of columnOrder[0].
- _correctness
packages/app/src/components/TimelineChart/TimelineChart.stories.tsx:70 — Unrelated to the plugin axis: this PR also deletes a changeset and rewrites the story's sample-trace generator, and the new generator sets childCount on flat rows without emitting matching child rows.
- Fix: Split the story/changeset change into its own PR and confirm the new fixture still exercises the intended waterfall cases.
- _{maintainability}
packages/hdx-eval/scripts/viewer/server.js:95 — The viewer's listCells walker and cellHeading/isMultiPlugin/cellModel hand-reimplement getRunFilesInBatch and columnKeyFor in browser/Node JS with no tests and a widened greedy (.+) route regex, so a layout or key-format change can silently desync the viewer from the reports.
- Fix: Add a fixture test asserting the JS walker/heading agree with the TS versions, or have the server expose the computed columnKey per cell so the viewer stops recomputing it.
- _{testing, maintainability}

Reviewers (4): correctness, kieran-typescript, testing, maintainability.

Testing gaps:

parsePluginFlag unknown-name/dedup/whitespace branches are untested (see P2).
No test covers escapeDirSegment collisions between distinct model/plugin names that sanitize to the same segment (e.g. my.plugin vs my-plugin), which would silently overwrite same-index runs on disk — pathological input, and the same class already applies to --model.
The viewer listCells walker and run-detail route regex have no unit coverage.

feat: Support plugin as a comparison axis in evals

d05a869

github-actions Bot added the review/tier-3 Standard — full human review required label Jul 2, 2026

greptile-apps Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread packages/hdx-eval/src/harness/types.ts

Comment thread packages/hdx-eval/src/hyperdx/config.ts

Comment thread packages/hdx-eval/scripts/viewer/public/app.js

review: Address review feedback

3cc2219

vercel Bot temporarily deployed to Preview – hyperdx-storybook July 2, 2026 14:16 Inactive

pulpdrew requested review from a team and fleon and removed request for a team July 2, 2026 14:27

review: Address baseline selection inconsistency

2f7e64e

vercel Bot temporarily deployed to Preview – hyperdx-storybook July 2, 2026 14:46 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support plugin as a comparison axis in evals#2579

feat: Support plugin as a comparison axis in evals#2579
pulpdrew wants to merge 3 commits into
mainfrom
drew/evals-plugin-axis

pulpdrew commented Jul 2, 2026

Uh oh!

changeset-bot Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pulpdrew commented Jul 2, 2026

Summary

Screenshots or video

How to test on Vercel preview

References

Uh oh!

changeset-bot Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

vercel Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟡 Tier 3 — Standard

Uh oh!

greptile-apps Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deep Review

🟡 P2 — recommended

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented Jul 2, 2026 •

edited

Loading

vercel Bot commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading

greptile-apps Bot commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading