A Researcher's Guide to Replication Packages: Episode 3

Episode 3: The Return of the Code

Congratulations on your paper's acceptance! When assembling your replication package, have your research assistant re-run all analyses. Upon inspection, you discover a discrepancy: Figure 2 doesn't match. The culprit emerges after investigation—code execution order matters significantly. Whether your master file runs everything at once or processes scripts sequentially can substantially affect results. This represents a common challenge in poorly structured replication code.

Your overconfidence is your weakness

Never assume the code execution process is self-evident. While obvious to you after weeks of final revisions, most users encountering your code for the first time (data editors, replicators, students) will need clear guidance.

Ideally, provide one master file that requires minimal path adjustments. A single execution should reproduce all results: data cleaning, merging, reshaping, analysis, tables, and figures. Reality often differs: you might work with restricted datasets, rely on multiple software environments, or face computationally intensive procedures. Document everything thoroughly in your README. Number files sequentially (e.g., "0_preprocessing.R", "1_cleaning.R", "2_analysis_tables.R", "3_analysis_figures.R").

Example documentation approaches:

"Part A requires restricted-access data from [organization]. This section is commented out; Tables X and Y cannot be reproduced. Replicators with access should name the dataset partA.dta, store it in the Raw folder, and execute Part A."

"Data cleaning occurs in Stata using master file '1_cleaning.do,' executed first. Analysis follows in MATLAB using master file '2_analysis.m'. Auxiliary files load automatically; both files require working directory adjustments on lines 2 and 4 respectively."

"Master file '0_master.do' runs all cleaning and analyses. Part C takes approximately 28 hours; a post-cleaning dataset is provided in the Intermediate folder for convenience. Comment out lines 2–132 to skip Part C."

Data Cleaning: A Few Tips

Data cleaning substantially influences results and deserves transparency.

Do separate data cleaning from analysis for easier navigation.
Do comment explaining each step and its rationale (e.g., "per pre-registration, we exclude participants without current employment" or "we merge exam scores from 2022 and 2024 for each student"). This proves invaluable when unexpected results occur.
Do verify that code performs intended functions through manual verification on manageable samples.

Analysis Code: More Tips

Clarity remains paramount for analysis scripts.

Do link code directly to your paper via comments like "Table 4, columns 1–5" or "footnote 7 regarding 91% subject comprehension rates". This expedites locating specific code related to results.
Do explain "unused" code segments. For instance, if a reviewer requested robustness testing with specific control variables and results remained stable, leave this code commented with notation: "Reviewer requested re-analysis with [control variable]; results robust to this adjustment. See lines 334–337 (commented out)."

Avoiding Oopsies: A Final Tip

Leaving self-directed notes in code is acceptable and often helpful, such as "TODO: verify this standard error clustering." Search for all to-do annotations before submission—data editors prefer polished code.

Smooth Operation

Your replication code represents dialogue with your future self and others reproducing your work. Structure that code carefully; it's your best opening line to that conversation.