feat: Support plugin as a comparison axis in evals#2579
Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
🟡 Tier 3 — StandardIntroduces new logic, modifies core functionality, or touches areas with non-trivial risk. Why this tier:
Review process: Full human review — logic, architecture, edge cases. Stats
|
Greptile SummaryThis PR adds
Confidence Score: 5/5Safe to merge — the plugin axis is additive and backward-compatible; existing batches without a plugin level are handled gracefully throughout the stack. The implementation mirrors the established model/MCP comparison patterns faithfully. All three previous review concerns (optional plugin field typing, the 'none' reserved-name guard, and the mixed-layout viewer false-positive) have been addressed. Backward compatibility with old on-disk layouts is handled in every walker. The plugin ?? PLUGIN_NONE guards are consistent across the aggregate, grade, and report paths. Tests cover column-key generation, config validation, grade layout, and viewer heading logic. No files require special attention. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant CLI as cli.ts (run cmd)
participant Parser as parsePluginFlag
participant Config as config.ts
participant RunCell as runCell (harness)
participant Claude as claudeSpawn (claude CLI)
participant FS as File System
CLI->>Parser: parsePluginFlag(--plugin none,myplugin)
Parser->>Config: configPluginNames(config)
Config-->>Parser: ['myplugin', ...]
Parser->>Config: getPluginDefinition(config, 'myplugin')
Config-->>Parser: "{ label, url/dir }"
Parser-->>CLI: ['none', 'myplugin']
loop each (mcp, model, plugin, runIndex)
CLI->>RunCell: "runCell({ plugin })"
RunCell->>Config: getPluginDefinition(config, plugin)
RunCell->>Claude: "runClaude({ pluginDef })"
Note over Claude: pluginCliArgs(def) → --plugin-url / --plugin-dir
Claude-->>RunCell: SpawnResult
RunCell-->>CLI: RunRecord (plugin field set)
CLI->>FS: writeRun → batch/scenario/mcp/model/plugin/idx.json
end
CLI->>CLI: buildAggregate (columnKeyFor per mcp/model/plugin)
CLI->>FS: writeBatchSummary → _summary.json + _summary.md
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant CLI as cli.ts (run cmd)
participant Parser as parsePluginFlag
participant Config as config.ts
participant RunCell as runCell (harness)
participant Claude as claudeSpawn (claude CLI)
participant FS as File System
CLI->>Parser: parsePluginFlag(--plugin none,myplugin)
Parser->>Config: configPluginNames(config)
Config-->>Parser: ['myplugin', ...]
Parser->>Config: getPluginDefinition(config, 'myplugin')
Config-->>Parser: "{ label, url/dir }"
Parser-->>CLI: ['none', 'myplugin']
loop each (mcp, model, plugin, runIndex)
CLI->>RunCell: "runCell({ plugin })"
RunCell->>Config: getPluginDefinition(config, plugin)
RunCell->>Claude: "runClaude({ pluginDef })"
Note over Claude: pluginCliArgs(def) → --plugin-url / --plugin-dir
Claude-->>RunCell: SpawnResult
RunCell-->>CLI: RunRecord (plugin field set)
CLI->>FS: writeRun → batch/scenario/mcp/model/plugin/idx.json
end
CLI->>CLI: buildAggregate (columnKeyFor per mcp/model/plugin)
CLI->>FS: writeBatchSummary → _summary.json + _summary.md
Reviews (3): Last reviewed commit: "review: Address baseline selection incon..." | Re-trigger Greptile |
E2E Test Results✅ All tests passed • 224 passed • 3 skipped • 1413s
Tests ran across 4 shards in parallel. |
Deep Review✅ No critical issues found. The plugin-axis feature is well-factored: the four duplicated batch-walkers are consolidated into a single tested 🟡 P2 — recommended
🔵 P3 nitpicks (5)
Reviewers (4): correctness, kieran-typescript, testing, maintainability. Testing gaps:
|
Summary
This PR adds support for
pluginconfigs in the MCP evals framework, so that claude plugin versions can be A/B tested (including the no-plugin case). This follows the pattern established by model configuration / comparison.Screenshots or video
How to test on Vercel preview
Testing:
Clone the plugin
Add the plugin to your evals config file
Run the evals
References