IAF Validation Roadmap — EM Foundation

Why this document exists. The EM-IAF Scientific Review (May 2026) identified findings at three levels of severity. All specification-level and structural findings were resolved through revisions to the IAF v1.0 and Pilot Benchmark v1.0. Three findings require empirical research that the Foundation has not yet conducted and cannot resolve through specification alone. Publishing this roadmap serves two purposes: it prevents the Foundation from representing the framework as more validated than it is, and it creates a documented, trackable commitment to completing the required research. An organization that identifies its own gaps and specifies a plan to close them is more credible than one that avoids discussing gaps. Assessments conducted before the research is complete are valid at their stated confidence level (L1 Provisional for current Pilot Benchmark work). That is a real constraint, not a disqualification.

I. What Has Already Been Resolved

All specification-level findings — closed

The following table summarizes all findings from the Scientific Review and their current resolution status. Only the three items in the Pending Research category remain open.

Finding	Type	Status
CRIT-004 — CI computation method unspecified	Specification	Resolved — bootstrap method specified in IAF §IV
MAJOR-001 — Floor threshold undocumented	Specification	Resolved — derivation documented; sensitivity table required in all reports
MAJOR-002 — Confidence levels single-factor	Specification	Resolved — two-factor system (sample × quality) in IAF §IV
MAJOR-003 — Ordinal scale treated as interval	Specification	Resolved — median aggregation specified in formula; IRT implementation deferred to Phase 2
MAJOR-005 — IRR requirements missing for most dimensions	Specification	Resolved — κ ≥ 0.60 specified for all human-review dimensions in weight table
MAJOR-006 — No test-retest protocol	Specification	Resolved — 20% item repetition and ICC requirement in IAF §VI
MOD-001 — Objective classification overclaims automation	Specification	Resolved — "Objective" relabeled "Structured" throughout
MOD-002 — Citation fabrication double-counted	Specification	Resolved — HAL scope excludes citation fabrication; CIT absorbs it
MOD-003 — EMO/DIG aggregation rule absent	Specification	Resolved — item-count-weighted formula specified in weight table and benchmark
MOD-004 — Governance Compatibility in composite	Specification	Resolved — reclassified as Supplemental; 3% redistributed
MINOR-001 — Floor optimization incentive unlabeled	Specification	Resolved — Marginal Floor Compliance label for [40–60] range
CRIT-002 — Benchmark-IAF structural misalignment	Structural	Partially resolved — aggregation rules specified; new Consistency and Domain Caution dimensions require benchmark development work (Phase 1)
MOD-005 — Published items enable targeted fine-tuning	Structural	Protocol specified in roadmap; implementation requires Shadow Track development (Phase 2)
MOD-006 — Rubric enables construct gaming	Structural	Contrast probe framework specified; item writing in progress (Phase 2)
CRIT-001 — Weights lack empirical basis	Empirical Research	Pending — Delphi study required (Research Program 1 below)
CRIT-003 — Sample size inadequate for L2 confidence	Empirical Research	Pending — Standard Benchmark required (Research Program 2 below)
MAJOR-007 — Accuracy/HAL correlation unmeasured	Empirical Research	Pending — first assessment cycle data required (Research Program 3 below)

II. Research Program 1 — Dimension Weight Calibration

Delphi Expert Consensus Study · CRIT-001

Delphi Expert Weight Calibration Study

Empirically grounding the IAF dimension weights through structured expert consensus

Target: 6 months

The IAF's dimension weights (16%, 16%, 13%, 12%, 10%, 8%, 8%, 7%, 6%, 4%) are theoretically derived. They reflect the Foundation's governance judgment but have not been validated against expert consensus. A weighted composite score built on theoretically-derived weights is internally consistent but cannot claim empirical support for the relative importance of each dimension.

The Delphi method is the appropriate tool: it aggregates judgment from a diverse panel of domain experts through structured, iterative rounds of elicitation, produces inter-expert agreement statistics, and is the standard approach for establishing weights in composite governance indices (used by the OECD Better Life Index, the Global Peace Index, and similar instruments).

Panel Composition

15–20 experts across: AI safety research (3–4), public health and consumer protection (3–4), law and legal information (2–3), policymaking and civic governance (2–3), technical AI systems (2–3), media and public information (2–3). Minimum 3 continents. No more than 2 panelists from any single institution.

Method

Two-round Delphi: Round 1 — blind pairwise importance judgments across all dimension pairs using Analytic Hierarchy Process (AHP) format. Round 2 — panelists receive anonymized Round 1 aggregate results and can revise their judgments with brief written rationale. Final weights derived from geometric mean of Round 2 judgments.

Outputs

Empirically calibrated weights with mean and standard deviation per dimension; inter-expert agreement statistics; weight ranges defining the defensible weight space; specific recommendations for weight adjustments by deployment context (medical, civic, general).

Gate Condition

Study complete when: panel assembled; two rounds administered; geometric mean weights calculated; inter-expert agreement ≥ 0.70 (Kendall's W) achieved. If agreement falls below 0.70 after round 2, extend to round 3 or expand panel.

What Changes After This Research Is Complete

IAF composite formula updated with Delphi-derived weights, published as IAF v1.1
Weight sensitivity range requirement maintained — replaced by inter-expert standard deviation ranges
All previously published L1 Provisional assessments remain valid at L1 — weight update does not retroactively invalidate them
The phrase "weights are theoretically derived" is replaced by "weights derived from Delphi study of N experts, [citation]"

Interim practice: Until the Delphi study is complete, all published assessment reports must include the weight sensitivity range (composite score under ±20% weight perturbation). This is the honest representation of what the score means given current weight uncertainty. It does not invalidate assessments — it correctly characterizes their precision.

III. Research Program 2 — Standard Benchmark Development

300-Item Benchmark with Structural Fixes · CRIT-002 and CRIT-003

Standard Benchmark Development

Expanding from 100 pilot items to 300 standard items with full IAF coverage

Target: Q4 2026

The Pilot Benchmark achieves S1 sample level (4–10 items per category), producing dimensional confidence intervals of ±23–37 points — too wide to support meaningful cross-system comparison. The Standard Benchmark expands to S2 sample level (25 items per dimension, 300 total) and resolves the structural misalignment between benchmark categories and IAF dimensions.

Two new IAF dimensions must also be formally defined and incorporated before the Standard Benchmark can be complete: Behavioral Consistency (measuring within-system response stability, replacing Governance Compatibility in the composite) and Domain Caution (absorbing Legal Ambiguity and Medical Caution behaviors currently unmapped). These dimension definitions are in scope for this research program.

New Item Count

300 standard items (25 per IAF dimension × 12 dimensions) + 60 unpublished Shadow Track items (20% reserve for anti-gaming injection). Total: 360 items written, 300 published, 60 held in reserve.

New Dimensions Required

Behavioral Consistency: 25 items measuring response stability across equivalent prompt paraphrases, framing variants, and cross-session repetitions. Domain Caution: 25 items covering legal ambiguity, medical caution, and similar high-stakes domain behaviors. Both require formal dimension definitions in IAF before items can be written.

Item Discrimination Analysis

After the first assessment cycle using the Standard Benchmark: compute item-total correlations for all items. Remove or revise items with r < 0.20. For binary-outcome items, flag items with pass rate p > 0.90 (trivially easy) or p < 0.10 (trivially hard) for replacement. This produces a refined Standard Benchmark with documented item quality.

Contrast Probe Development

5–10 behavioral contrast probe pairs per dimension (60–120 items total). Contrast probes are not included in the composite score but administered alongside the standard items to detect construct gaming. A system scoring substantially better on standard items than structurally similar contrast probes raises a gaming flag for review.

Gate Conditions for S2 Confidence Level

All 12 IAF dimension definitions published (including Behavioral Consistency and Domain Caution)
300 standard items written, reviewed for bias and ideological balance by multi-party panel, and piloted on at least 3 AI systems before release
Shadow Track 60 items complete and secured
Assessment-intent registration protocol live
Composite CI and weight sensitivity computation tools published at emfoundation.net/iaf-tools

Until this research is complete: The Pilot Benchmark is labeled L1 Provisional on all materials. Assessment reports using the Pilot Benchmark must display: "L1 PROVISIONAL — INTERNAL USE ONLY. Dimensional confidence intervals ±23–37 points. Not valid for external assessment claims or deployment guidance." This is accurate and appropriate. It does not prevent the Foundation from doing valuable internal work with the Pilot Benchmark — including assessor training, methodology calibration, and developing the expertise needed to conduct rigorous assessments once the Standard Benchmark is available.

IV. Research Program 3 — Accuracy / Hallucination Correlation Study

Inter-Dimensional Correlation Measurement · MAJOR-007

Accuracy–Hallucination Resistance Correlation Study

Measuring whether the two highest-weighted dimensions are redundant

Target: First Assessment Cycle

Accuracy (16%) and Hallucination Resistance (16%) together represent 32% of the IAF composite weight. If these two dimensions are highly correlated (ρ ≥ 0.65 across a sample of assessed systems), the composite score effectively over-represents this shared construct — the "not being factually wrong" cluster — relative to other dimensions. Published AI benchmark research suggests high correlation between these constructs is probable, but this is a prediction, not a measurement.

Required Data

Dimensional scores from at least 5 assessed AI systems, produced under the same IAF version. Assessments can be L1 Provisional — the correlation analysis requires only that the assessments use the same methodology, not that they meet S2 sample standards.

Analysis

Compute Pearson correlation between Accuracy and Hallucination Resistance dimensional scores across all assessed systems. If ρ > 0.65: evaluate merging into a single Factual Integrity dimension OR applying a correlation penalty to the combined weight using: effective_weight = (w₁ + w₂) × (2 / (1 + ρ)) / 2. Present findings and recommendation to the Foundation's Scientific Advisory Panel for decision.

Decision Criteria

ρ < 0.50: no change needed. ρ 0.50–0.65: monitor and document. ρ > 0.65: weight adjustment or dimension merger required before next IAF major version. ρ > 0.85: strong case for merging into single Factual Integrity dimension with Accuracy and Hallucination as sub-scores.

Extended Analysis

Also compute correlations between other plausibly-correlated dimension pairs: Uncertainty Disclosure and Citation Integrity; Manipulation Resistance and Civic Responsibility. Document all inter-dimensional correlations as the Foundation's internal validity dataset. Publish the correlation matrix alongside the IAF v2.0 release.

Gate Condition

5+ assessed systems with IAF scores available
Correlation matrix computed and reviewed by Scientific Advisory Panel
Decision on weight adjustment or dimension merger documented and published
Any resulting weight changes incorporated in IAF v1.1 or v2.0 as appropriate

Until this research is complete: The current Accuracy and Hallucination Resistance weights (16%/16%) should be treated as potentially inflated by correlation. This does not invalidate assessments — it means that a composite score difference driven primarily by these two dimensions may not represent as much genuine behavioral diversity as it appears to. Assessment reports should include a note: "Accuracy and Hallucination Resistance inter-dimensional correlation has not yet been empirically measured. These two dimensions together represent 32% of composite weight. See IAF Validation Roadmap for the measurement plan."

V. Summary Gate Conditions by Use Level

What must be true before the framework can be used for each purpose

Use	Current Status	Gate Conditions
Internal assessor training and calibration	Permitted now	None — Pilot Benchmark at L1 Provisional is appropriate for this purpose
Methodology development and documentation	Permitted now	None
Pilot assessments for internal Foundation learning	Permitted now	Label all results L1 Provisional. Include CI ranges. Include weight sensitivity ranges. Do not publish externally.
Published research with explicit L1 limitations	Conditional	Assessors must confirm understanding that CI widths are ±23–37 pts per dimension. Results must be labeled L1 Provisional throughout. No comparative ranking claims between systems.
External assessment with published scores (L2)	Not yet permitted	Standard Benchmark complete (300 items, 25/dimension); IRR protocol operational; bootstrap CI tools published; assessment-intent registration live. All RP2 gate conditions met.
Deployment guidance for moderate-stakes applications (L3)	Not yet permitted	L2 gate conditions + Delphi weight study complete (RP1) + item discrimination analysis complete + IRT pilot calibration conducted
Consequential certification or high-stakes deployment guidance (L4–L5)	Not yet permitted	All three research programs complete + independent replication + peer-reviewed validation publication + known-groups validity evidence

VI. Governance Commitment

The Foundation commits to the following on this roadmap:

Research Program 1 (Delphi weight calibration) will be commissioned within 90 days of the Foundation's first external assessment cycle commencing, with results published within 9 months of commissioning.
Research Program 2 (Standard Benchmark) will be completed before any assessment results are published externally or used in deployment guidance, regardless of any interim demand for published scores. The Pilot Benchmark L1 Provisional restriction is not negotiable until RP2 gate conditions are met.
Research Program 3 (correlation study) will be conducted after the first 5 assessments are complete, with results published within 30 days of analysis completion.
This roadmap will be updated publicly when any gate condition is met, and the IAF version will be updated accordingly.
If the Foundation publishes any external assessment result before RP2 is complete, it will be because the Foundation has made a deliberate decision to publish at L1 Provisional with full disclosure of the limitations described in this document — not because those limitations have been resolved or forgotten.

The Foundation's credibility is not built by claiming the framework is complete. It is built by being specific about what is and is not complete — and by doing the work to close those gaps in the open, with documented commitments that can be checked against our actual behavior.

IAF Validation Roadmap v1.0 · Published May 2026 · Updated when gate conditions are met