The research plan addressing three empirically pending findings from the EM-IAF Scientific Review Report. This document is published alongside the IAF as a commitment to the specific work required before the framework is suitable for consequential use.
The following table summarizes all findings from the Scientific Review and their current resolution status. Only the three items in the Pending Research category remain open.
| Finding | Type | Status |
|---|---|---|
| CRIT-004 — CI computation method unspecified | Specification | Resolved — bootstrap method specified in IAF §IV |
| MAJOR-001 — Floor threshold undocumented | Specification | Resolved — derivation documented; sensitivity table required in all reports |
| MAJOR-002 — Confidence levels single-factor | Specification | Resolved — two-factor system (sample × quality) in IAF §IV |
| MAJOR-003 — Ordinal scale treated as interval | Specification | Resolved — median aggregation specified in formula; IRT implementation deferred to Phase 2 |
| MAJOR-005 — IRR requirements missing for most dimensions | Specification | Resolved — κ ≥ 0.60 specified for all human-review dimensions in weight table |
| MAJOR-006 — No test-retest protocol | Specification | Resolved — 20% item repetition and ICC requirement in IAF §VI |
| MOD-001 — Objective classification overclaims automation | Specification | Resolved — "Objective" relabeled "Structured" throughout |
| MOD-002 — Citation fabrication double-counted | Specification | Resolved — HAL scope excludes citation fabrication; CIT absorbs it |
| MOD-003 — EMO/DIG aggregation rule absent | Specification | Resolved — item-count-weighted formula specified in weight table and benchmark |
| MOD-004 — Governance Compatibility in composite | Specification | Resolved — reclassified as Supplemental; 3% redistributed |
| MINOR-001 — Floor optimization incentive unlabeled | Specification | Resolved — Marginal Floor Compliance label for [40–60] range |
| CRIT-002 — Benchmark-IAF structural misalignment | Structural | Partially resolved — aggregation rules specified; new Consistency and Domain Caution dimensions require benchmark development work (Phase 1) |
| MOD-005 — Published items enable targeted fine-tuning | Structural | Protocol specified in roadmap; implementation requires Shadow Track development (Phase 2) |
| MOD-006 — Rubric enables construct gaming | Structural | Contrast probe framework specified; item writing in progress (Phase 2) |
| CRIT-001 — Weights lack empirical basis | Empirical Research | Pending — Delphi study required (Research Program 1 below) |
| CRIT-003 — Sample size inadequate for L2 confidence | Empirical Research | Pending — Standard Benchmark required (Research Program 2 below) |
| MAJOR-007 — Accuracy/HAL correlation unmeasured | Empirical Research | Pending — first assessment cycle data required (Research Program 3 below) |
Empirically grounding the IAF dimension weights through structured expert consensus
The IAF's dimension weights (16%, 16%, 13%, 12%, 10%, 8%, 8%, 7%, 6%, 4%) are theoretically derived. They reflect the Foundation's governance judgment but have not been validated against expert consensus. A weighted composite score built on theoretically-derived weights is internally consistent but cannot claim empirical support for the relative importance of each dimension.
The Delphi method is the appropriate tool: it aggregates judgment from a diverse panel of domain experts through structured, iterative rounds of elicitation, produces inter-expert agreement statistics, and is the standard approach for establishing weights in composite governance indices (used by the OECD Better Life Index, the Global Peace Index, and similar instruments).
15–20 experts across: AI safety research (3–4), public health and consumer protection (3–4), law and legal information (2–3), policymaking and civic governance (2–3), technical AI systems (2–3), media and public information (2–3). Minimum 3 continents. No more than 2 panelists from any single institution.
Two-round Delphi: Round 1 — blind pairwise importance judgments across all dimension pairs using Analytic Hierarchy Process (AHP) format. Round 2 — panelists receive anonymized Round 1 aggregate results and can revise their judgments with brief written rationale. Final weights derived from geometric mean of Round 2 judgments.
Empirically calibrated weights with mean and standard deviation per dimension; inter-expert agreement statistics; weight ranges defining the defensible weight space; specific recommendations for weight adjustments by deployment context (medical, civic, general).
Study complete when: panel assembled; two rounds administered; geometric mean weights calculated; inter-expert agreement ≥ 0.70 (Kendall's W) achieved. If agreement falls below 0.70 after round 2, extend to round 3 or expand panel.
Interim practice: Until the Delphi study is complete, all published assessment reports must include the weight sensitivity range (composite score under ±20% weight perturbation). This is the honest representation of what the score means given current weight uncertainty. It does not invalidate assessments — it correctly characterizes their precision.
Expanding from 100 pilot items to 300 standard items with full IAF coverage
The Pilot Benchmark achieves S1 sample level (4–10 items per category), producing dimensional confidence intervals of ±23–37 points — too wide to support meaningful cross-system comparison. The Standard Benchmark expands to S2 sample level (25 items per dimension, 300 total) and resolves the structural misalignment between benchmark categories and IAF dimensions.
Two new IAF dimensions must also be formally defined and incorporated before the Standard Benchmark can be complete: Behavioral Consistency (measuring within-system response stability, replacing Governance Compatibility in the composite) and Domain Caution (absorbing Legal Ambiguity and Medical Caution behaviors currently unmapped). These dimension definitions are in scope for this research program.
300 standard items (25 per IAF dimension × 12 dimensions) + 60 unpublished Shadow Track items (20% reserve for anti-gaming injection). Total: 360 items written, 300 published, 60 held in reserve.
Behavioral Consistency: 25 items measuring response stability across equivalent prompt paraphrases, framing variants, and cross-session repetitions. Domain Caution: 25 items covering legal ambiguity, medical caution, and similar high-stakes domain behaviors. Both require formal dimension definitions in IAF before items can be written.
After the first assessment cycle using the Standard Benchmark: compute item-total correlations for all items. Remove or revise items with r < 0.20. For binary-outcome items, flag items with pass rate p > 0.90 (trivially easy) or p < 0.10 (trivially hard) for replacement. This produces a refined Standard Benchmark with documented item quality.
5–10 behavioral contrast probe pairs per dimension (60–120 items total). Contrast probes are not included in the composite score but administered alongside the standard items to detect construct gaming. A system scoring substantially better on standard items than structurally similar contrast probes raises a gaming flag for review.
Until this research is complete: The Pilot Benchmark is labeled L1 Provisional on all materials. Assessment reports using the Pilot Benchmark must display: "L1 PROVISIONAL — INTERNAL USE ONLY. Dimensional confidence intervals ±23–37 points. Not valid for external assessment claims or deployment guidance." This is accurate and appropriate. It does not prevent the Foundation from doing valuable internal work with the Pilot Benchmark — including assessor training, methodology calibration, and developing the expertise needed to conduct rigorous assessments once the Standard Benchmark is available.
Measuring whether the two highest-weighted dimensions are redundant
Accuracy (16%) and Hallucination Resistance (16%) together represent 32% of the IAF composite weight. If these two dimensions are highly correlated (ρ ≥ 0.65 across a sample of assessed systems), the composite score effectively over-represents this shared construct — the "not being factually wrong" cluster — relative to other dimensions. Published AI benchmark research suggests high correlation between these constructs is probable, but this is a prediction, not a measurement.
Dimensional scores from at least 5 assessed AI systems, produced under the same IAF version. Assessments can be L1 Provisional — the correlation analysis requires only that the assessments use the same methodology, not that they meet S2 sample standards.
Compute Pearson correlation between Accuracy and Hallucination Resistance dimensional scores across all assessed systems. If ρ > 0.65: evaluate merging into a single Factual Integrity dimension OR applying a correlation penalty to the combined weight using: effective_weight = (w₁ + w₂) × (2 / (1 + ρ)) / 2. Present findings and recommendation to the Foundation's Scientific Advisory Panel for decision.
ρ < 0.50: no change needed. ρ 0.50–0.65: monitor and document. ρ > 0.65: weight adjustment or dimension merger required before next IAF major version. ρ > 0.85: strong case for merging into single Factual Integrity dimension with Accuracy and Hallucination as sub-scores.
Also compute correlations between other plausibly-correlated dimension pairs: Uncertainty Disclosure and Citation Integrity; Manipulation Resistance and Civic Responsibility. Document all inter-dimensional correlations as the Foundation's internal validity dataset. Publish the correlation matrix alongside the IAF v2.0 release.
Until this research is complete: The current Accuracy and Hallucination Resistance weights (16%/16%) should be treated as potentially inflated by correlation. This does not invalidate assessments — it means that a composite score difference driven primarily by these two dimensions may not represent as much genuine behavioral diversity as it appears to. Assessment reports should include a note: "Accuracy and Hallucination Resistance inter-dimensional correlation has not yet been empirically measured. These two dimensions together represent 32% of composite weight. See IAF Validation Roadmap for the measurement plan."
| Use | Current Status | Gate Conditions |
|---|---|---|
| Internal assessor training and calibration | Permitted now | None — Pilot Benchmark at L1 Provisional is appropriate for this purpose |
| Methodology development and documentation | Permitted now | None |
| Pilot assessments for internal Foundation learning | Permitted now | Label all results L1 Provisional. Include CI ranges. Include weight sensitivity ranges. Do not publish externally. |
| Published research with explicit L1 limitations | Conditional | Assessors must confirm understanding that CI widths are ±23–37 pts per dimension. Results must be labeled L1 Provisional throughout. No comparative ranking claims between systems. |
| External assessment with published scores (L2) | Not yet permitted | Standard Benchmark complete (300 items, 25/dimension); IRR protocol operational; bootstrap CI tools published; assessment-intent registration live. All RP2 gate conditions met. |
| Deployment guidance for moderate-stakes applications (L3) | Not yet permitted | L2 gate conditions + Delphi weight study complete (RP1) + item discrimination analysis complete + IRT pilot calibration conducted |
| Consequential certification or high-stakes deployment guidance (L4–L5) | Not yet permitted | All three research programs complete + independent replication + peer-reviewed validation publication + known-groups validity evidence |
The Foundation commits to the following on this roadmap:
The Foundation's credibility is not built by claiming the framework is complete. It is built by being specific about what is and is not complete — and by doing the work to close those gaps in the open, with documented commitments that can be checked against our actual behavior.
IAF Validation Roadmap v1.0 · Published May 2026 · Updated when gate conditions are met