EM Foundation ← IAF Methodology
Intelligence Assessment Framework — Governance Document

IAF Validation Roadmap

The research plan addressing three empirically pending findings from the EM-IAF Scientific Review Report. This document is published alongside the IAF as a commitment to the specific work required before the framework is suitable for consequential use.

Why this document exists. The EM-IAF Scientific Review (May 2026) identified findings at three levels of severity. All specification-level and structural findings were resolved through revisions to the IAF v1.0 and Pilot Benchmark v1.0. Three findings require empirical research that the Foundation has not yet conducted and cannot resolve through specification alone. Publishing this roadmap serves two purposes: it prevents the Foundation from representing the framework as more validated than it is, and it creates a documented, trackable commitment to completing the required research. An organization that identifies its own gaps and specifies a plan to close them is more credible than one that avoids discussing gaps. Assessments conducted before the research is complete are valid at their stated confidence level (L1 Provisional for current Pilot Benchmark work). That is a real constraint, not a disqualification.

I. What Has Already Been Resolved

All specification-level findings — closed

The following table summarizes all findings from the Scientific Review and their current resolution status. Only the three items in the Pending Research category remain open.

FindingTypeStatus
CRIT-004 — CI computation method unspecifiedSpecificationResolved — bootstrap method specified in IAF §IV
MAJOR-001 — Floor threshold undocumentedSpecificationResolved — derivation documented; sensitivity table required in all reports
MAJOR-002 — Confidence levels single-factorSpecificationResolved — two-factor system (sample × quality) in IAF §IV
MAJOR-003 — Ordinal scale treated as intervalSpecificationResolved — median aggregation specified in formula; IRT implementation deferred to Phase 2
MAJOR-005 — IRR requirements missing for most dimensionsSpecificationResolved — κ ≥ 0.60 specified for all human-review dimensions in weight table
MAJOR-006 — No test-retest protocolSpecificationResolved — 20% item repetition and ICC requirement in IAF §VI
MOD-001 — Objective classification overclaims automationSpecificationResolved — "Objective" relabeled "Structured" throughout
MOD-002 — Citation fabrication double-countedSpecificationResolved — HAL scope excludes citation fabrication; CIT absorbs it
MOD-003 — EMO/DIG aggregation rule absentSpecificationResolved — item-count-weighted formula specified in weight table and benchmark
MOD-004 — Governance Compatibility in compositeSpecificationResolved — reclassified as Supplemental; 3% redistributed
MINOR-001 — Floor optimization incentive unlabeledSpecificationResolved — Marginal Floor Compliance label for [40–60] range
CRIT-002 — Benchmark-IAF structural misalignmentStructuralPartially resolved — aggregation rules specified; new Consistency and Domain Caution dimensions require benchmark development work (Phase 1)
MOD-005 — Published items enable targeted fine-tuningStructuralProtocol specified in roadmap; implementation requires Shadow Track development (Phase 2)
MOD-006 — Rubric enables construct gamingStructuralContrast probe framework specified; item writing in progress (Phase 2)
CRIT-001 — Weights lack empirical basisEmpirical ResearchPending — Delphi study required (Research Program 1 below)
CRIT-003 — Sample size inadequate for L2 confidenceEmpirical ResearchPending — Standard Benchmark required (Research Program 2 below)
MAJOR-007 — Accuracy/HAL correlation unmeasuredEmpirical ResearchPending — first assessment cycle data required (Research Program 3 below)

II. Research Program 1 — Dimension Weight Calibration

Delphi Expert Consensus Study · CRIT-001

Delphi Expert Weight Calibration Study

Empirically grounding the IAF dimension weights through structured expert consensus

Target: 6 months

The IAF's dimension weights (16%, 16%, 13%, 12%, 10%, 8%, 8%, 7%, 6%, 4%) are theoretically derived. They reflect the Foundation's governance judgment but have not been validated against expert consensus. A weighted composite score built on theoretically-derived weights is internally consistent but cannot claim empirical support for the relative importance of each dimension.

The Delphi method is the appropriate tool: it aggregates judgment from a diverse panel of domain experts through structured, iterative rounds of elicitation, produces inter-expert agreement statistics, and is the standard approach for establishing weights in composite governance indices (used by the OECD Better Life Index, the Global Peace Index, and similar instruments).

Panel Composition

15–20 experts across: AI safety research (3–4), public health and consumer protection (3–4), law and legal information (2–3), policymaking and civic governance (2–3), technical AI systems (2–3), media and public information (2–3). Minimum 3 continents. No more than 2 panelists from any single institution.

Method

Two-round Delphi: Round 1 — blind pairwise importance judgments across all dimension pairs using Analytic Hierarchy Process (AHP) format. Round 2 — panelists receive anonymized Round 1 aggregate results and can revise their judgments with brief written rationale. Final weights derived from geometric mean of Round 2 judgments.

Outputs

Empirically calibrated weights with mean and standard deviation per dimension; inter-expert agreement statistics; weight ranges defining the defensible weight space; specific recommendations for weight adjustments by deployment context (medical, civic, general).

Gate Condition

Study complete when: panel assembled; two rounds administered; geometric mean weights calculated; inter-expert agreement ≥ 0.70 (Kendall's W) achieved. If agreement falls below 0.70 after round 2, extend to round 3 or expand panel.

What Changes After This Research Is Complete
  • IAF composite formula updated with Delphi-derived weights, published as IAF v1.1
  • Weight sensitivity range requirement maintained — replaced by inter-expert standard deviation ranges
  • All previously published L1 Provisional assessments remain valid at L1 — weight update does not retroactively invalidate them
  • The phrase "weights are theoretically derived" is replaced by "weights derived from Delphi study of N experts, [citation]"

Interim practice: Until the Delphi study is complete, all published assessment reports must include the weight sensitivity range (composite score under ±20% weight perturbation). This is the honest representation of what the score means given current weight uncertainty. It does not invalidate assessments — it correctly characterizes their precision.

III. Research Program 2 — Standard Benchmark Development

300-Item Benchmark with Structural Fixes · CRIT-002 and CRIT-003

Standard Benchmark Development

Expanding from 100 pilot items to 300 standard items with full IAF coverage

Target: Q4 2026

The Pilot Benchmark achieves S1 sample level (4–10 items per category), producing dimensional confidence intervals of ±23–37 points — too wide to support meaningful cross-system comparison. The Standard Benchmark expands to S2 sample level (25 items per dimension, 300 total) and resolves the structural misalignment between benchmark categories and IAF dimensions.

Two new IAF dimensions must also be formally defined and incorporated before the Standard Benchmark can be complete: Behavioral Consistency (measuring within-system response stability, replacing Governance Compatibility in the composite) and Domain Caution (absorbing Legal Ambiguity and Medical Caution behaviors currently unmapped). These dimension definitions are in scope for this research program.

New Item Count

300 standard items (25 per IAF dimension × 12 dimensions) + 60 unpublished Shadow Track items (20% reserve for anti-gaming injection). Total: 360 items written, 300 published, 60 held in reserve.

New Dimensions Required

Behavioral Consistency: 25 items measuring response stability across equivalent prompt paraphrases, framing variants, and cross-session repetitions. Domain Caution: 25 items covering legal ambiguity, medical caution, and similar high-stakes domain behaviors. Both require formal dimension definitions in IAF before items can be written.

Item Discrimination Analysis

After the first assessment cycle using the Standard Benchmark: compute item-total correlations for all items. Remove or revise items with r < 0.20. For binary-outcome items, flag items with pass rate p > 0.90 (trivially easy) or p < 0.10 (trivially hard) for replacement. This produces a refined Standard Benchmark with documented item quality.

Contrast Probe Development

5–10 behavioral contrast probe pairs per dimension (60–120 items total). Contrast probes are not included in the composite score but administered alongside the standard items to detect construct gaming. A system scoring substantially better on standard items than structurally similar contrast probes raises a gaming flag for review.

Gate Conditions for S2 Confidence Level
  • All 12 IAF dimension definitions published (including Behavioral Consistency and Domain Caution)
  • 300 standard items written, reviewed for bias and ideological balance by multi-party panel, and piloted on at least 3 AI systems before release
  • Shadow Track 60 items complete and secured
  • Assessment-intent registration protocol live
  • Composite CI and weight sensitivity computation tools published at emfoundation.net/iaf-tools

Until this research is complete: The Pilot Benchmark is labeled L1 Provisional on all materials. Assessment reports using the Pilot Benchmark must display: "L1 PROVISIONAL — INTERNAL USE ONLY. Dimensional confidence intervals ±23–37 points. Not valid for external assessment claims or deployment guidance." This is accurate and appropriate. It does not prevent the Foundation from doing valuable internal work with the Pilot Benchmark — including assessor training, methodology calibration, and developing the expertise needed to conduct rigorous assessments once the Standard Benchmark is available.

IV. Research Program 3 — Accuracy / Hallucination Correlation Study

Inter-Dimensional Correlation Measurement · MAJOR-007

Accuracy–Hallucination Resistance Correlation Study

Measuring whether the two highest-weighted dimensions are redundant

Target: First Assessment Cycle

Accuracy (16%) and Hallucination Resistance (16%) together represent 32% of the IAF composite weight. If these two dimensions are highly correlated (ρ ≥ 0.65 across a sample of assessed systems), the composite score effectively over-represents this shared construct — the "not being factually wrong" cluster — relative to other dimensions. Published AI benchmark research suggests high correlation between these constructs is probable, but this is a prediction, not a measurement.

Required Data

Dimensional scores from at least 5 assessed AI systems, produced under the same IAF version. Assessments can be L1 Provisional — the correlation analysis requires only that the assessments use the same methodology, not that they meet S2 sample standards.

Analysis

Compute Pearson correlation between Accuracy and Hallucination Resistance dimensional scores across all assessed systems. If ρ > 0.65: evaluate merging into a single Factual Integrity dimension OR applying a correlation penalty to the combined weight using: effective_weight = (w₁ + w₂) × (2 / (1 + ρ)) / 2. Present findings and recommendation to the Foundation's Scientific Advisory Panel for decision.

Decision Criteria

ρ < 0.50: no change needed. ρ 0.50–0.65: monitor and document. ρ > 0.65: weight adjustment or dimension merger required before next IAF major version. ρ > 0.85: strong case for merging into single Factual Integrity dimension with Accuracy and Hallucination as sub-scores.

Extended Analysis

Also compute correlations between other plausibly-correlated dimension pairs: Uncertainty Disclosure and Citation Integrity; Manipulation Resistance and Civic Responsibility. Document all inter-dimensional correlations as the Foundation's internal validity dataset. Publish the correlation matrix alongside the IAF v2.0 release.

Gate Condition
  • 5+ assessed systems with IAF scores available
  • Correlation matrix computed and reviewed by Scientific Advisory Panel
  • Decision on weight adjustment or dimension merger documented and published
  • Any resulting weight changes incorporated in IAF v1.1 or v2.0 as appropriate

Until this research is complete: The current Accuracy and Hallucination Resistance weights (16%/16%) should be treated as potentially inflated by correlation. This does not invalidate assessments — it means that a composite score difference driven primarily by these two dimensions may not represent as much genuine behavioral diversity as it appears to. Assessment reports should include a note: "Accuracy and Hallucination Resistance inter-dimensional correlation has not yet been empirically measured. These two dimensions together represent 32% of composite weight. See IAF Validation Roadmap for the measurement plan."

V. Summary Gate Conditions by Use Level

What must be true before the framework can be used for each purpose
UseCurrent StatusGate Conditions
Internal assessor training and calibrationPermitted nowNone — Pilot Benchmark at L1 Provisional is appropriate for this purpose
Methodology development and documentationPermitted nowNone
Pilot assessments for internal Foundation learningPermitted nowLabel all results L1 Provisional. Include CI ranges. Include weight sensitivity ranges. Do not publish externally.
Published research with explicit L1 limitationsConditionalAssessors must confirm understanding that CI widths are ±23–37 pts per dimension. Results must be labeled L1 Provisional throughout. No comparative ranking claims between systems.
External assessment with published scores (L2)Not yet permittedStandard Benchmark complete (300 items, 25/dimension); IRR protocol operational; bootstrap CI tools published; assessment-intent registration live. All RP2 gate conditions met.
Deployment guidance for moderate-stakes applications (L3)Not yet permittedL2 gate conditions + Delphi weight study complete (RP1) + item discrimination analysis complete + IRT pilot calibration conducted
Consequential certification or high-stakes deployment guidance (L4–L5)Not yet permittedAll three research programs complete + independent replication + peer-reviewed validation publication + known-groups validity evidence

VI. Governance Commitment

The Foundation commits to the following on this roadmap:

  1. Research Program 1 (Delphi weight calibration) will be commissioned within 90 days of the Foundation's first external assessment cycle commencing, with results published within 9 months of commissioning.
  2. Research Program 2 (Standard Benchmark) will be completed before any assessment results are published externally or used in deployment guidance, regardless of any interim demand for published scores. The Pilot Benchmark L1 Provisional restriction is not negotiable until RP2 gate conditions are met.
  3. Research Program 3 (correlation study) will be conducted after the first 5 assessments are complete, with results published within 30 days of analysis completion.
  4. This roadmap will be updated publicly when any gate condition is met, and the IAF version will be updated accordingly.
  5. If the Foundation publishes any external assessment result before RP2 is complete, it will be because the Foundation has made a deliberate decision to publish at L1 Provisional with full disclosure of the limitations described in this document — not because those limitations have been resolved or forgotten.

The Foundation's credibility is not built by claiming the framework is complete. It is built by being specific about what is and is not complete — and by doing the work to close those gaps in the open, with documented commitments that can be checked against our actual behavior.

IAF Validation Roadmap v1.0 · Published May 2026 · Updated when gate conditions are met