The AI Replication Engine: First Experiments and What's Next

The Institute for Replication has moved its AI Replication Engine from theoretical framework to operational testing. Previously introduced as an autonomous verification system, the project now demonstrates concrete results from initial benchmarking work.

Key Experimental Findings

The team evaluated three open-source language models on reproducing research from "Racial Flux and Voting Behavior," measuring performance against 16 published metrics including coefficients, R² values, sample sizes, and standard errors.

Results showed significant performance variation unrelated to model scale. The smallest model (glm-4.7-flash at 8-bit quantization) achieved perfect accuracy—matching all 16 metrics—in approximately five minutes. Meanwhile, larger models underperformed substantially: qwen3-coder:30b exhausted its iteration limit after completing minimal comparisons, while qwen3-next:80b (80 billion parameters) failed completely, unable to resolve basic file system operations.

The Institute concluded that instruction-following precision and workflow management prove more critical than raw parameter count for replication tasks.

Funding and Future Directions

The project has submitted a SSHRC Insight Development Grant application requesting $95,000 across two years (August 2026–July 2028). Proposed activities include systematic evaluation using 250+ papers from the I4R Games dataset and approximately 430 World Bank policy research documents, multi-model comparison employing zero-shot, few-shot, and fine-tuning strategies, and open-source release of code, trained weights, and benchmark datasets.

Preliminary results are targeted for NeurIPS 2026 submission, with toolkit availability anticipated later in 2026. The Institute invites contributions of replication data, model testing participation, and collaborative evaluation opportunities.