The AI Replication Engine: Inside the Benchmark Behind the Beta

In our last update we showed the AI Replication Engine moving from a one-off demonstration toward consistent performance across a benchmark. We have completed the systematic evaluation we promised: a new paper, currently under peer review, that tests the Engine on 74 economics and political-science papers across three distinct verification tasks, using two frontier models (GPT-5.5 and Claude Opus 4.7) under identical tools, prompts, and execution settings. Crucially, the evaluation was designed to separate findings the system recovered from existing replication reports from genuinely new outputs, and to have expert annotators judge those new outputs on their merits rather than scoring them as mistakes by default.

Left panel, computational reproducibility: both GPT-5.5 and Claude Opus 4.7 reach about 99% executability agreement and roughly 87 F1 on comparison and correspondence. Right panel, robustness-check proposals: more than 90% are judged valid-novel while hallucination rates stay below 2%.

The clearest result is also the most reassuring: computational reproducibility holds up at scale. Both models agree with the reference on whether a replication package actually runs about 99% of the time, and reach roughly 87 F1 at linking the numbers reported in a paper to the values their code regenerates. Once the Engine binds the right reported and regenerated quantities together, it correctly decides whether they match about 98% of the time. In other words, the remaining gap is mostly about finding and correctly mapping every table comparison, not about executing analyses or recognizing numerical agreement.

The harder, and more interesting, tasks are the ones that matter most for real verification work. Detecting mismatches between what a manuscript describes and what its code actually does proved difficult and noisy: the system surfaces real candidate issues, but the two models trade off differently between catching more and staying precise, and a meaningful share of flags still need expert adjudication. Proposing robustness checks tells a more encouraging story. These proposals rarely match the specific checks in expert reports one-for-one, yet when annotators reviewed the system's own suggestions, more than 90% were judged feasible, methodologically defensible, and non-duplicative, with hallucination rates under 2%. The Engine is not recovering a fixed answer key so much as generating useful new checks a human reviewer can act on.

Taken together, these results point to a system that is promising but supervision-dependent, and that shapes how we are building the beta. The Engine is deliberately diagnostic rather than interventionist: it executes code, flags numerical mismatches, raises possible implementation issues, and proposes robustness checks, but it leaves the substantive judgments to people. The goal for the months ahead is a practical workflow that pairs fast, reliable automated checks with targeted human oversight which is useful to journals, researchers, and research organizations alike. If you would like to try it or help us test it, the beta will be available soon.