Glossary

Terms used across the registry.

Plain-language definitions of the 26 statistical, methodological, and reporting concepts that appear throughout the registry and evidence syntheses. Written for the curious non-statistician; rigorous enough to share with a colleague.

By category

Study designs (6)
Statistical concepts (6)
Implementation & reporting (7)
Causal inference (7)

Study designs

Cluster-randomized trial#

An RCT in which groups (clinics, schools, neighborhoods) — not individuals — are randomly assigned to conditions.

Cluster randomization is necessary when an intervention naturally applies to a whole group (a school-wide curriculum change, a neighborhood policing program) and when contamination between treated and untreated individuals in the same unit would muddy the result.

The statistical cost is real: cluster trials have effectively fewer 'independent observations' than the head count suggests, because outcomes within a cluster are correlated. Required sample sizes can be many times larger than for individual randomization. The intraclass correlation coefficient (ICC) is what quantifies the penalty.

Difference-in-differences (DiD)#

Also called: DiD

A quasi-experimental method that compares change-over-time in a treated group to change-over-time in an untreated comparison group.

DiD answers the question: did the treated group's outcome change more (or less) than the comparison group's did, over the same period? The key identifying assumption — 'parallel trends' — is that, absent the treatment, the two groups' outcomes would have moved in parallel. The DiD estimate is the gap between actual treated-group change and the change the comparison group experienced.

DiD has become the workhorse of state-policy evaluation in the US precisely because variation across states and timing of reform adoption gives researchers many natural experiments to exploit.

Quasi-experimental design#

A study that estimates causal effects using natural variation rather than random assignment.

When randomization isn't feasible — for ethical, political, or operational reasons — researchers can still estimate causal effects by exploiting natural variation that's plausibly as-good-as-random. Common designs include difference-in-differences (comparing before/after across treated and untreated jurisdictions), regression discontinuity (comparing units just above and below an eligibility cutoff), and instrumental variables.

Quasi-experimental designs require more assumptions than RCTs to justify a causal interpretation, but those assumptions are often defensible — and when they are, the resulting evidence can be very strong. Many of the registry's most-cited findings (EITC effects, Moving to Opportunity, Bolsa Família long-run impacts) rest on quasi-experimental designs.

Randomized controlled trial (RCT)#

Also called: RCT

A study that randomly assigns participants to one or more treatment conditions and a control, then compares outcomes between groups.

An RCT is the cleanest available test of whether a specific intervention causes a specific outcome. Random assignment means that, in expectation, the only systematic difference between groups is the intervention itself — so any difference in outcomes can be attributed to it.

In civic experiments, the randomization unit can be individuals (each person gets the message or not), clusters (each city, school, or clinic gets the treatment or not), or time periods (the intervention is rolled out in randomly chosen waves). Each design has tradeoffs for sample-size requirements and contamination risk between conditions.

Regression discontinuity (RDD)#

Also called: RDD

A quasi-experimental method that compares units just above and just below an arbitrary eligibility cutoff.

When eligibility for a program depends on a sharp cutoff (a test score, an income threshold, a date of birth), units very close to either side are likely similar in everything but their treatment status. RDD compares these near-cutoff units to estimate the effect of the program at the threshold.

The internal validity at the cutoff is very strong; the catch is that RDD only estimates the effect for marginally-eligible units, not the average effect across the whole eligible population.

Stepped-wedge design#

A trial in which all clusters eventually receive the treatment, but in randomly assigned waves over time.

Stepped-wedge designs solve a common civic-experiment dilemma: the agency has decided the intervention will roll out everywhere, and ethically can't withhold it from a long-running control group. By randomizing the *order* of rollout across clusters, the design preserves causal identification while ensuring no cluster is permanently denied treatment.

The analysis has to account for time trends, since every cluster transitions from untreated to treated at some point. Done well, stepped-wedge gives strong causal evidence under realistic implementation constraints.

Statistical concepts

Effect size#

The magnitude of an intervention's impact — in raw outcome units (e.g. +5 percentage points) or standardized units (e.g. Cohen's d).

Effect size answers 'how much did the intervention move things?' Two effect sizes matter most for policy: the *absolute* change (5 percentage points more people vaccinated) and the *relative* change (a 25% increase off a 20% baseline). Both should be reported.

Standardized effect sizes (Cohen's d, odds ratios, hazard ratios) let researchers compare effects across studies with different outcome scales. Useful for meta-analysis, but absolute effects are usually more meaningful for decision-makers.

Intraclass correlation (ICC)#

Also called: ICC

The share of variance in an outcome attributable to between-cluster differences. Drives sample-size requirements in cluster-randomized trials.

If outcomes within a cluster (a school, a clinic, a neighborhood) are highly correlated — students at the same school have similar test scores — then adding more students from the same cluster gives you less new information than adding students from a different cluster.

The ICC quantifies this. An ICC of 0.1 means 10% of variance is between-cluster. A cluster trial with that ICC and 20 students per cluster has effective sample size roughly 7× smaller than a comparable individually-randomized trial. Ignore ICC and your power calculation will be wrong.

Minimum detectable effect (MDE)#

Also called: MDE

The smallest true effect size a study has a fair chance of detecting, given its sample size, alpha, and power.

MDE is the question 'how small an effect could we reliably see?' rephrased as a single number. A study with a 10-percentage-point MDE will reliably detect an effect of 10 points or more — and is likely to miss effects smaller than that, even if those smaller effects are real.

MDE matters because it makes the design choices explicit before the trial runs. A study designed to detect huge effects will produce null results for smaller-but-real ones. The MDE should be set in conversation with the policy question — what's the smallest effect that would change a decision?

Statistical power#

Also called: power, 1 - beta

The probability that a study correctly detects a real effect of a specified size.

Power is conventionally set at 0.80 — that is, an 80% chance of finding the effect if it exists. Higher power requires larger sample sizes; lower power saves money but risks the worst outcome of all, an underpowered null result that the agency then uses to defend the status quo.

The right power for a study depends on the cost of a false negative. If a null result will end a promising program, 0.80 is too low and 0.90 is more honest.

Statistical significance (α)#

Also called: alpha, p-value

The probability of observing a difference as large as the one in the data if no true effect existed (the p-value); 'significant' usually means p < 0.05.

A 'statistically significant' result means the data would be unlikely under the null hypothesis of no effect. By convention, this is set at α = 0.05 — meaning we accept a 5% false-positive rate.

Statistical significance does not measure the size or importance of an effect, only the strength of the evidence that some effect exists. A statistically significant 0.1-percentage-point change is real but probably not worth scaling; a non-significant 5-point change in a small trial may be the truth, just unproven.

Type I and Type II errors#

Also called: Type I error, Type II error, false positive, false negative

Type I = false positive (declaring an effect that isn't real). Type II = false negative (missing a real effect).

Every study trades off these two errors. Lowering α (stricter significance threshold) reduces false positives but increases false negatives. Raising power reduces false negatives but requires larger samples. There's no free lunch — the question is which error is more costly in the specific context.

In high-stakes policy decisions, a false-negative null result that kills a working program may be more harmful than a false-positive that scales a marginal program. Reasonable people can disagree about which error to fear more, but the design choice should be deliberate.

Implementation & reporting

Attrition#

The share of enrolled participants who drop out before the outcome is measured.

Attrition is the big external threat to RCTs. If 30% of participants drop out, and dropouts differ systematically between treatment and control arms, the remaining sample is no longer randomly comparable — and any effect estimate from it is biased.

Good practice: report attrition by arm, check whether dropouts differ on observable characteristics, and conduct sensitivity analysis assuming the worst-case (or best-case) for dropouts. Studies with >20% differential attrition between arms should be interpreted very cautiously.

Compliance#

The share of assigned-to-treatment participants who actually received the treatment.

Civic experiments rarely achieve 100% compliance. People assigned to receive a text reminder may have changed phone numbers; people assigned to attend a workshop may not show up. Compliance below ~80% can substantially attenuate ITT effects relative to the true effect on those who actually got the treatment.

Reporting compliance honestly is part of reporting the trial honestly. A null result with 30% compliance tells you the program-as-offered didn't work; it doesn't tell you the program-as-received doesn't work.

Intention-to-treat (ITT)#

Also called: ITT

Analyze every participant according to the group they were assigned to, regardless of whether they actually complied with the assigned treatment.

ITT is the conservative, honest default. If 100 people are assigned to a program and 40 don't show up, the ITT estimate compares all 100 to their controls — not just the 60 who participated. This reflects the real-world effect of *offering* the program at scale, which is usually what policy decisions need.

Reporting only the per-protocol effect (the 60 who actually participated) inflates the apparent effect by removing the non-compliers, who likely differ systematically from compliers. The two estimates are both useful, but ITT is the default and per-protocol the supplement.

Per-protocol analysis#

Analyze only the participants who actually received and completed the assigned treatment.

Per-protocol estimates answer 'what was the effect for those who actually took the treatment?' That's a meaningful question, but the answer is subject to selection bias — compliers and non-compliers usually differ in ways that affect the outcome.

Report per-protocol alongside ITT, never instead of it. The gap between the two estimates is itself informative — it tells you what implementation looks like in practice.

Preregistration#

Publicly committing to a study's hypotheses, sample size, primary outcomes, and analysis plan before data collection begins.

Preregistration solves a quiet but serious problem in social science: a researcher who can pick outcomes, subgroups, and tests after seeing the data can almost always find *some* significant result. Preregistration commits to the analysis upfront so that what's reported is what was tested, not what worked.

The AEA RCT Registry, AsPredicted, and OSF all host preregistrations. Civic experiments should preregister as a matter of routine — both for the credibility of positive results and the credibility of null ones.

Primary outcome#

The single (or small set of) pre-specified outcome(s) the study is designed to detect changes in.

Studies that report 14 outcomes and headline the two that came out significant are doing it wrong. Preregistering one (or at most a few) primary outcomes — chosen before data collection — keeps the analysis disciplined.

Secondary outcomes are fine to report, but should be labeled as such. The right way to interpret a 'positive secondary outcome with a null primary' is as an interesting hypothesis to be tested in the next trial, not as evidence the intervention worked.

Replication status#

Whether a finding has been re-tested in an independent trial. 'Replicated' is strong evidence; 'open' is the default for newer or one-off trials.

A single trial is a starting point, not a conclusion. Replication is what distinguishes a finding that holds from a finding that happened. The registry tags every entry: 'replicated' (independent trial confirmed the result), 'partial' (some confirmation, some divergence), 'open' (no replication attempt yet), or 'na' (the design doesn't admit easy replication).

The replication crisis in psychology and economics is in part the result of treating one significant result as conclusive. Civic experimentation has the advantage of being naturally replicable — different cities can run the same pilot — but only if the field treats replication as the norm rather than the exception.

Causal inference

Confounding#

A third variable that affects both the treatment and the outcome, distorting the apparent causal relationship between them.

Classic example: ice cream sales are correlated with drowning rates, but neither causes the other — both are caused by hot weather (the confounder). In observational data, confounding is everywhere. Random assignment in an RCT is what eliminates it: assignment is by definition unrelated to any other variable.

Much of empirical methodology — propensity score matching, instrumental variables, regression adjustment — is about handling confounding in non-experimental data. None of these methods is as clean as randomization, but each works under specific assumptions.

Control group#

The participants who do not receive the intervention, used as the counterfactual for what would have happened without it.

The whole point of a control group is to estimate the counterfactual — what would have happened to the treated group in the absence of treatment. Random assignment is what justifies treating the control group's outcomes as a valid estimate of that counterfactual.

'Control' doesn't always mean 'nothing.' It often means 'whatever the current practice is' (active control) so that the experiment estimates the marginal effect of the new intervention over the status quo.

External validity#

The extent to which a study's findings generalize to other populations, settings, or times beyond the one studied.

A trial that works in San Francisco may not work in Birmingham. A 2008 result may not hold in 2026. A finding for low-income mothers may not transfer to working-age men. External validity is the question of when (and why) we should expect transferability — and when we shouldn't.

Replication in different sites is the most reliable way to test it. Theoretical reasoning about why the effect occurred is also necessary; if the mechanism is universal (cost-benefit calculation), generalization is more credible than if it relied on a specific local condition (a particular community organization, a particular labor market).

Garden of forking paths#

The proliferation of analysis choices that lets researchers find a 'significant' result almost regardless of the underlying data.

For any single dataset, a researcher could plausibly choose any of dozens of analysis specifications: which outcomes, which subgroups, which covariates, which transformations, which inclusion criteria. Each is defensible. Together they create what Andrew Gelman called the 'garden of forking paths' — enough flexibility that 'something significant' is almost guaranteed even when the underlying effect is zero.

The corrective is preregistration plus honest reporting of the full analysis space. A study that says 'we tested 14 outcomes and 1 was significant' is making a much weaker claim than a study that says 'we tested our preregistered primary outcome and it was significant.'

Multiple comparisons#

The increased risk of false positives that comes from testing many hypotheses; addressed by stricter significance thresholds (e.g. Bonferroni) or false-discovery-rate methods.

If you test 20 outcomes at α = 0.05, you expect 1 to come out significant by chance even if no effect exists. A study that reports the one 'significant' outcome from 20 is making a misleading claim.

Standard adjustments include Bonferroni (divide α by the number of tests — strict but easy), Holm-Bonferroni (sequential, slightly less strict), and false-discovery-rate methods (control the expected proportion of false positives). The right choice depends on the study, but ignoring the issue isn't an option.

Selection bias#

Bias that arises when the people who end up in the study (or in a particular group within it) differ systematically from the people you wanted to study.

If a job-training program enrolls only motivated applicants, and then measures their earnings against the general population, the apparent 'effect' is mostly selection — the motivated would have earned more anyway. Random assignment fixes selection into treatment. It doesn't fix selection into the study in the first place, which is a separate concern about whom you can generalize to.

Attrition is a form of selection bias that develops *during* a study: dropouts may differ from completers, and analyzing only completers reintroduces the bias randomization eliminated.

Treatment effect#

The causal effect of the intervention on an outcome, estimated as the difference between treated and control groups.

The 'average treatment effect' (ATE) is the effect averaged across the population. The 'average treatment effect on the treated' (ATT) is the effect specifically among those who got the treatment — which may differ from ATE if compliance was selective.

Local average treatment effects (LATE) appear in instrumental-variable and regression-discontinuity designs and apply only to the subset of the population whose treatment status was determined by the instrument or cutoff.