A low-cost wrapper that turns AI uncertainty into visible infrastructure — structured explanations when a system cannot responsibly answer at the requested reliance level
AI systems that cannot answer a question reliably have two choices: produce a confident-sounding answer of uncertain quality, or refuse. Current systems do both — but neither serves the user's actual need. Confident unreliable answers mislead. Opaque refusals give the user nothing to work with. Both options treat uncertainty as something to hide rather than something to communicate.
Failure Receipts Standalone proposes a third option: a structured, human-readable explanation of why a reliable answer is unavailable, what conditions would need to be met to produce one, and what partial assistance is available at a lower reliance level. This is not a refusal. It is a structured account of uncertainty that gives the user actionable information rather than hiding the limitation in either confident prose or a blank wall.
This paper presents the design for a standalone Failure Receipts wrapper — a lightweight system that can be placed in front of any existing AI model — along with worked examples in legal, healthcare, policy, and research domains, a benchmark measuring hallucination reduction, and an open-source contribution invitation.
An ordinary AI refusal looks like this: "I cannot provide legal advice." Or: "I don't have access to current information about that topic." These responses tell the user that something went wrong without telling them what, why, or what they could do instead. They treat uncertainty as a wall.
A Failure Receipt looks like this:
The Failure Receipt tells the user exactly what failed, why it failed, what they need to do, and what they can still get. It treats uncertainty as infrastructure — something to be communicated clearly rather than hidden or deflected.
The goal is not that AI systems should refuse more often. The goal is that when a system lacks sufficient verified continuity to answer responsibly, the user receives a structured account of why — and what to do about it.
Not all Failure Receipts are equivalent. A system that cannot cite its sources for a casual research query has failed differently from a system that returns contradictory medical dosing information for a clinical query. The Failure Severity Index (FSI) classifies failures by their potential consequence severity, calibrating the urgency and specificity of the required human response.
| Level | Name | Trigger Conditions | Required Response |
|---|---|---|---|
| FR-1 | Informational gap | Coverage or freshness below RC-2 threshold; no contradictions; low-consequence domain | Note limitations inline; recommend verification for important decisions |
| FR-2 | Reliability warning | Multiple dimensions below threshold; moderate contradiction; RC-3 context | Explicit Failure Receipt; human review before professional use; partial output at lower RC available |
| FR-3 | Professional escalation | Aggregate below RC-4 threshold; critical contradiction; legal or regulatory context | Failure Receipt with specific escalation pathway; qualified professional review mandatory; no filing without review |
| FR-4 | Safety-critical halt | Any dimension below RC-5 floor; or domain is medical/public-safety; or contradiction involves dosing, contraindication, or safety procedure | Full halt; licensed expert review mandatory; system logs incident; escalation to supervising clinician or safety officer |
The FSI does not replace the RC classification system — it supplements it. An RC-4 query that fails produces a Failure Receipt; the FSI determines whether that receipt recommends professional consultation (FR-3) or triggers an institutional safety protocol (FR-4). The distinction matters operationally: FR-3 and FR-4 failures should be logged separately, escalated differently, and reviewed at different frequencies in system audits.
Figure 1 — Failure Severity Ladder. FR-1 through FR-4 calibrate response urgency to consequence severity. The reliance-severity matrix shows the default mapping from RC failure level to FSI classification.
The Failure Receipts Standalone wrapper sits between the user and any existing AI model. It does not require modification to the underlying model. It operates as a pre-processing and post-processing layer that intercepts queries, runs a lightweight confidence assessment, and either passes a verified query through to the model with a CR attached, or returns a Failure Receipt directly.
Query: "What are the current penalties for securities fraud under 18 USC 1348?"
RC-4 Assessment: Source quality 0.88 (statute text verifiable). Retrieval coverage 0.71 (recent enforcement interpretations incomplete). Internal consistency 0.82. Temporal freshness 0.65 (sentencing practice evolving post-2023 enforcement shift). Domain confidence 0.79. Aggregate: 0.77. Threshold for RC-4: 0.85. Result: Failure Receipt.
Failure Receipt action: Statutory text provided at RC-2. Recent enforcement practice requires current case law review with securities counsel. Specific penalty calculation requires qualified attorney review of charging documents.
Query: "What is the recommended dosing protocol for vancomycin in a patient with CKD stage 4?"
RC-5 Assessment: Source quality 0.91 (pharmacology literature comprehensive). Retrieval coverage 0.84 (recent AUC-guided dosing protocols included). Internal consistency 0.78 (minor variation in trough targets across guidelines). Domain confidence 0.88. Aggregate: 0.86. Threshold for RC-5: 0.90. Result: Failure Receipt.
Failure Receipt action: General vancomycin dosing principles and CKD adjustment guidance provided at RC-3. RC-5 clinical application requires prescribing physician review with patient-specific renal function parameters. AUC monitoring recommended per ASHP/IDSA/SIDP guidelines.
Query: "What has been the fiscal impact of the 2022 Inflation Reduction Act's clean energy provisions to date?"
RC-3 Assessment: Source quality 0.74 (mix of CBO, Treasury, advocacy sources). Retrieval coverage 0.68 (2025 data incomplete). Internal consistency 0.52 (significant methodological disagreement on investment attribution). Domain confidence 0.71. Aggregate: 0.67. Threshold for RC-3: 0.70. Result: Failure Receipt.
Failure Receipt note: The low internal consistency reflects genuine methodological disagreement in the literature about how to attribute investment to the IRA specifically. This is domain uncertainty, not retrieval failure. RC-2 analysis with explicit uncertainty framing available.
The measurable hypothesis: a Failure Receipts wrapper reduces the rate of unsupported or hallucinated claims in high-consequence domains by preventing the model from generating responses when confidence conditions are not met.
| Test Condition | Method | Expected Result |
|---|---|---|
| High-risk prompt set without wrapper | 50 legal/medical/policy prompts at RC-4/5 level; LLM-judge scores each response for unsupported claims | Baseline hallucination rate |
| Same prompts with FR wrapper at RC-4 | FR wrapper intercepts queries that fail threshold; model only called for passing queries | Reduced hallucination rate on model-generated responses; Failure Receipts for remainder |
| User outcome comparison | Blind evaluation: which condition better serves the user's actual need? | FR condition preferred for high-consequence queries |
This section states explicitly what the architecture implies but does not make sufficiently direct: Failure Receipts are evidentiary support systems, not autonomous decision-making authorities.
A Failure Receipt that identifies insufficient confidence for an RC-5 medical query does not determine that the query cannot be answered. It determines that the query cannot be answered at RC-5 confidence by the current system under current conditions, and that a qualified human reviewer must evaluate whether to proceed, how to supplement the AI output, or what alternative information sources to consult. The decision authority remains with the human reviewer at all times.
This distinction matters for legal and regulatory positioning. In jurisdictions where AI-assisted medical or legal decision-making is regulated, the question of whether an AI system is "making decisions" or "providing decision support" is legally consequential. Failure Receipts are explicitly structured as decision support infrastructure — they provide structured epistemic information to human reviewers; they do not replace human judgment.
Three escalation principles follow from this:
FR-4 receipts require qualified human review, not merely human acknowledgment. A physician who clicks "acknowledge" on an FR-4 Failure Receipt without substantive clinical review has not fulfilled the governance requirement. The review obligation is substantive, not procedural.
Failure Receipts do not transfer liability to the system. An organization that deploys a Failure Receipt system and then acts on AI outputs without the specified human review cannot use the existence of the Failure Receipt as evidence of due diligence. The receipt documents the threshold failure; the subsequent human review is where the governance obligation is fulfilled.
The escalation pathway must be specified before deployment, not during an incident. FR-3 and FR-4 receipts require escalation to qualified reviewers. Who those reviewers are, how they are reached, what turnaround time is required, and what happens when they are unavailable must be specified in governance documentation before the system is deployed in high-consequence contexts.
The Failure Receipts Standalone wrapper is the lightweight predecessor to the full Continuity Receipts system demonstrated at emfoundation.net/cr-lite.html. Where CR-Lite demonstrates the complete CR architecture — provenance chains, five-dimension confidence scoring, nutrition labels, and Failure Receipts — the standalone wrapper provides only the Failure Receipt component as a drop-in addition to any existing AI deployment.
This sequencing matters for adoption. Organizations that cannot yet deploy the full CR infrastructure can adopt the Failure Receipts wrapper immediately — it requires only a thin API wrapper and a confidence assessment layer. The Failure Receipt format is fully compatible with the OCMS schema, so organizations that later adopt the full CR standard can migrate their Failure Receipt logs directly.
Failure Receipts create a new class of potential abuse that the governance framework must address explicitly: performative transparency — the use of Failure Receipt infrastructure to create the appearance of accountability without its substance.
Audit flooding. An organization could configure its Failure Receipt system to generate receipts at high volume, burying meaningful signals in administrative noise. If every query generates a receipt regardless of confidence, the receipts become meaningless. Mitigation: FSI severity classification should be statistically monitored — if FR-3 and FR-4 rates fall to near-zero or rise above 40%, the threshold calibration requires investigation.
Threshold manipulation. An organization could configure RC thresholds downward — classifying all queries as RC-1 or RC-2 — to ensure that almost no queries fail. This produces high pass rates while providing no meaningful governance. Mitigation: RC levels should be set by use context, not by the deploying organization alone. Audit logs should record declared RC levels alongside actual use context for periodic review.
False receipt generation. A sophisticated adversary could generate Failure Receipts for queries that actually passed, creating a paper trail of apparent due diligence that conceals actual practice. Mitigation: append-only receipt chains with hash verification make retroactive fabrication detectable. Receipt chain integrity should be verified in any governance audit.
Confidence calibration drift. The confidence assessment layer depends on the wrapper's ability to accurately score source quality, retrieval coverage, and domain confidence. These scores are model-generated estimates, not ground truth. A wrapper whose confidence assessment is systematically overconfident will issue standard responses for queries that should trigger Failure Receipts. Calibration validation against known-reliable and known-unreliable query sets is a prerequisite for production deployment.
Adversarial confidence inflation. A model that "knows" it is being assessed for confidence may inflate its self-reported scores. The wrapper should treat model-generated confidence scores as inputs to independent validation rather than as authoritative assessments. At minimum, temporal freshness should be verified against external timestamps rather than model self-report.
FSI misclassification. Automatically assigning FR-4 severity to all medical queries regardless of the specific question is both over-restrictive (a general health question about aspirin dosing does not warrant safety-critical halt) and under-specific (it fails to distinguish genuinely dangerous queries from benign ones). Domain classification alone is insufficient for FSI assignment — the query's specific nature and the failure's specific dimension must both inform severity.
Latency overhead. The confidence assessment layer adds processing time before every query. For real-time applications, this overhead must be below the perceptible latency threshold. Lightweight confidence assessment (keyword-based domain classification plus cached source quality scores) can reduce overhead to under 50ms for most queries. Full retrieval-based assessment may add 200-500ms. Contributors should benchmark overhead separately from accuracy.
What Would Falsify the Core Claim — that Failure Receipts reduce harmful AI outputs without unacceptably increasing refusal rates:
What we need built:
A Python wrapper library that accepts a query, reliance level, and model API call, runs the lightweight confidence assessment, and returns either a model response with CR metadata or a structured Failure Receipt JSON. Compatible with Claude, OpenAI, and any OpenAI-compatible API.
A test suite of 100+ high-risk prompts across legal, medical, policy, and research domains — specifically designed to trigger Failure Receipts — with expected outcomes for validation.
Visual Failure Receipt card components in React and plain HTML — matching the visual style at emfoundation.net — for integration into any web application.
A benchmark report comparing hallucination rates with and without the wrapper across the test suite.
Repository: github.com/emfoundation/failure-receipts
Contact: research@emfoundation.net
This section follows the Foundation's institutional practice of explicitly stating known weaknesses, failure modes, and scope boundaries for every proposal. Its presence indicates analytical maturity, not weakness in the underlying proposal.
False attribution. The confidence assessment layer estimates source quality and retrieval coverage using proxy signals, not ground truth. A wrapper that systematically overestimates confidence in a particular domain will generate standard responses when Failure Receipts are warranted — and users will not know the assessment was wrong until harm occurs.
Receipt flooding and bureaucratic weaponization. An organization with poor underlying AI quality may generate so many Failure Receipts that reviewers become desensitized. Conversely, organizations may use high Failure Receipt rates to avoid accountability — generating FR-3 receipts for queries that could be handled at RC-2 to create procedural cover.
Threshold calibration dependency. The RC thresholds are proposed defaults not validated against empirical outcomes. Miscalibrated thresholds produce either excessive restriction (operational friction without governance benefit) or insufficient restriction (receipts that pass queries that should have been escalated).
Human review quality. The system creates the right checkpoints. It cannot guarantee that human reviewers have the expertise to evaluate what they receive. A mandatory FR-4 review completed by an unqualified reviewer provides procedural compliance without substantive safety.
Without structured failure transparency, AI systems in high-consequence domains default to confident prose regardless of underlying confidence. The absence of Failure Receipt infrastructure produces systematic overconfidence in legal, medical, and policy AI deployments; no institutional record of when AI systems failed to meet confidence thresholds; and no evidentiary basis for learning from AI-assisted errors after the fact. The harms accumulate quietly — visible only when individual cases surface, then attributed to AI error in general rather than to the specific absence of structured uncertainty communication.
What is the correct mapping between query domains and RC levels across different institutional contexts? How should threshold calibration be validated empirically — and who should conduct that validation? Can automated confidence assessment achieve sufficient accuracy for RC-4 and RC-5 contexts, or does meaningful assessment at those levels require human domain expertise at the assessment stage rather than only the review stage?
Failure Receipt systems require governance frameworks specifying: who sets RC thresholds for a given deployment; how threshold calibration is reviewed and updated; what constitutes an acceptable Failure Receipt rate; how FR-3 and FR-4 incidents are escalated and investigated; and what the evidentiary status of a Failure Receipt is in legal or regulatory proceedings. Without these frameworks, Failure Receipt infrastructure produces paperwork rather than accountability.
NIST SP 800-30 (2012). Guide for Conducting Risk Assessments. · ISO 9001:2015 Quality Management Systems. · Hollnagel, E. (2004). Barriers and Accident Prevention. Ashgate. · Kahneman, D. (2011). Thinking, Fast and Slow — anchoring and overconfidence as precedents for structured uncertainty communication. · EM Foundation. Continuity Receipts Standards Proposal v0.1. emfoundation.net/paper-continuity-receipts.html