Skip to content

Releases: facebookresearch/ProgramBench

v1.2.3

Choose a tag to compare

@klieret klieret released this 29 Jun 15:24
2b9b4d8

What's Changed

Minor fixes

These fixes only affect stability, not evaluation results.

  • fix(container): kill orphaned in-container process on execute() timeout by @kunchenguid in #31

New Contributors

Full Changelog: v1.2.2...v1.2.3

v1.2.2

Choose a tag to compare

@klieret klieret released this 23 Jun 15:12
d26c959

What's Changed

  • Fix(eval): dependency_ignored test updates by @klieret in #49

Full Changelog: v1.2.1...v1.2.2

v1.2.1

Choose a tag to compare

@john-b-yang john-b-yang released this 23 Jun 14:32
dd8b5a4

What's Changed

  • package: mean_score over full benchmark (not attempted) by @john-b-yang in #48

Full Changelog: v1.2.0...v1.2.1

v1.2.0

Choose a tag to compare

@john-b-yang john-b-yang released this 22 Jun 19:16

What's Changed

  • Add programbench submit (package/verify/publish/register) by @john-b-yang in #39

New Contributors

Full Changelog: v1.1.0...v1.2.0

v1.1.0

Choose a tag to compare

@klieret klieret released this 18 Jun 21:34
ede4bdb

What's Changed

This release fixes several issues with the eval harness. If you are evaluating on ProgramBench we strongly recommend you to update. Most fixes should not require rerunning agents except for a small loophole described in #45 and #14 (first raised by suche-ux in #14) and fixed by new docker images (#46). Annotating existing agent trajectories should make it easy to flag which instances were affected.

  • Fix(eval): block build-script internet for submissions by @klieret in #41
  • Fix(eval): Ignore flaky and otherwise unsuitable tests by @klieret in #40
  • Fix(eval): evaluate in :task_cleanroom images by @klieret in #42
  • Fix(eval): default to v6 docker images by @klieret in #46

New Contributors

Full Changelog: v1.0.2...v1.1.0

v1.0.2

Choose a tag to compare

@klieret klieret released this 11 May 16:58
b33e660

This minor release ignores ~30 tests that caused hangs when evaluating incorrect solutions.

Full Changelog: v1.0.1...v1.0.2

v1.0.1

Choose a tag to compare

@klieret klieret released this 07 May 12:45
1fe64c8

What's Changed

  • Fix: stderr messages can corrupt XML coverage report (#5), thanks for the report @darshanmakwana412

New Contributors

Full Changelog: v1.0.0...v1.0.1

ProgramBench 🦊

Choose a tag to compare

@klieret klieret released this 05 May 14:31
2803dcc

How much of SQLite, FFmpeg, PHP compiler can Opus 4.7 rebuild from scratch? Given just an executable and no starter code or internet access.

Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end.

Read more: https://programbench.com/

image