Failure Receipts Standalone

Abstract

AI systems that cannot answer a question reliably have two choices: produce a confident-sounding answer of uncertain quality, or refuse. Current systems do both — but neither serves the user's actual need. Confident unreliable answers mislead. Opaque refusals give the user nothing to work with. Both options treat uncertainty as something to hide rather than something to communicate.

Failure Receipts Standalone proposes a third option: a structured, human-readable explanation of why a reliable answer is unavailable, what conditions would need to be met to produce one, and what partial assistance is available at a lower reliance level. This is not a refusal. It is a structured account of uncertainty that gives the user actionable information rather than hiding the limitation in either confident prose or a blank wall.

This paper presents the design for a standalone Failure Receipts wrapper — a lightweight system that can be placed in front of any existing AI model — along with worked examples in legal, healthcare, policy, and research domains, a benchmark measuring hallucination reduction, and an open-source contribution invitation.

I. The Distinction That Matters — Failure Receipts Are Not Refusals

An ordinary AI refusal looks like this: "I cannot provide legal advice." Or: "I don't have access to current information about that topic." These responses tell the user that something went wrong without telling them what, why, or what they could do instead. They treat uncertainty as a wall.

A Failure Receipt looks like this:

Failure Receipt — RC-4 Continuity Integrity Insufficient

Query domain:Federal sentencing guidelines — tax fraud

Requested level:RC-4 — Legal filing

Failed dimensions:Retrieval coverage (0.42) — November 2024 USSC amendments not in corpus. Internal consistency (0.38) — two sources conflict on first-offender threshold treatment.

Required action:Review current USSC Guidelines Manual (2024 edition) with qualified federal criminal defense counsel before filing.

Partial output:General framework and pre-2024 guidelines available at RC-2 research level on request. Not suitable for filing.

Recommended sources:USSC.gov Guidelines Manual, Westlaw federal sentencing annotations, circuit-specific commentary.

The Failure Receipt tells the user exactly what failed, why it failed, what they need to do, and what they can still get. It treats uncertainty as infrastructure — something to be communicated clearly rather than hidden or deflected.

The goal is not that AI systems should refuse more often. The goal is that when a system lacks sufficient verified continuity to answer responsibly, the user receives a structured account of why — and what to do about it.

I.5 The Failure Severity Index

Not all Failure Receipts are equivalent. A system that cannot cite its sources for a casual research query has failed differently from a system that returns contradictory medical dosing information for a clinical query. The Failure Severity Index (FSI) classifies failures by their potential consequence severity, calibrating the urgency and specificity of the required human response.

Level	Name	Trigger Conditions	Required Response
FR-1	Informational gap	Coverage or freshness below RC-2 threshold; no contradictions; low-consequence domain	Note limitations inline; recommend verification for important decisions
FR-2	Reliability warning	Multiple dimensions below threshold; moderate contradiction; RC-3 context	Explicit Failure Receipt; human review before professional use; partial output at lower RC available
FR-3	Professional escalation	Aggregate below RC-4 threshold; critical contradiction; legal or regulatory context	Failure Receipt with specific escalation pathway; qualified professional review mandatory; no filing without review
FR-4	Safety-critical halt	Any dimension below RC-5 floor; or domain is medical/public-safety; or contradiction involves dosing, contraindication, or safety procedure	Full halt; licensed expert review mandatory; system logs incident; escalation to supervising clinician or safety officer

The FSI does not replace the RC classification system — it supplements it. An RC-4 query that fails produces a Failure Receipt; the FSI determines whether that receipt recommends professional consultation (FR-3) or triggers an institutional safety protocol (FR-4). The distinction matters operationally: FR-3 and FR-4 failures should be logged separately, escalated differently, and reviewed at different frequencies in system audits.

Figure 1 — Failure Severity Ladder. FR-1 through FR-4 calibrate response urgency to consequence severity. The reliance-severity matrix shows the default mapping from RC failure level to FSI classification.

II. System Design — The Standalone Wrapper

The Failure Receipts Standalone wrapper sits between the user and any existing AI model. It does not require modification to the underlying model. It operates as a pre-processing and post-processing layer that intercepts queries, runs a lightweight confidence assessment, and either passes a verified query through to the model with a CR attached, or returns a Failure Receipt directly.

Wrapper Architecture — Pseudocodefunction handle_query(query, reliance_level, model):

    # Pre-processing: assess confidence conditions
    assessment = assess_continuity(query, reliance_level)

    if assessment.passes_threshold:
        # Pass through to model with CR generation
        response = model.generate(query)
        receipt = generate_cr(response, assessment)
        return { "status": "PASS", "response": response, "receipt": receipt }

    else:
        # Return Failure Receipt without calling model
        failure_receipt = generate_failure_receipt(query, reliance_level, assessment)
        return { "status": "FAILURE", "receipt": failure_receipt }

function assess_continuity(query, reliance_level):
    threshold = RC_THRESHOLDS[reliance_level]

    scores = {
        "source_quality":       assess_source_quality(query),
        "retrieval_coverage":   assess_coverage(query),
        "internal_consistency": assess_consistency(query),
        "temporal_freshness":   assess_freshness(query),
        "domain_confidence":    assess_domain_calibration(query)
    }

    aggregate = weighted_average(scores, DIMENSION_WEIGHTS)

    return AssessmentResult(
        scores=scores,
        aggregate=aggregate,
        passes=aggregate >= threshold,
        failed_dimensions=[k for k,v in scores.items() if v < threshold * 0.85]
    )

III. Domain Worked Examples

III.1 Legal Analysis

Query: "What are the current penalties for securities fraud under 18 USC 1348?"

RC-4 Assessment: Source quality 0.88 (statute text verifiable). Retrieval coverage 0.71 (recent enforcement interpretations incomplete). Internal consistency 0.82. Temporal freshness 0.65 (sentencing practice evolving post-2023 enforcement shift). Domain confidence 0.79. Aggregate: 0.77. Threshold for RC-4: 0.85. Result: Failure Receipt.

Failure Receipt action: Statutory text provided at RC-2. Recent enforcement practice requires current case law review with securities counsel. Specific penalty calculation requires qualified attorney review of charging documents.

III.2 Healthcare

Query: "What is the recommended dosing protocol for vancomycin in a patient with CKD stage 4?"

RC-5 Assessment: Source quality 0.91 (pharmacology literature comprehensive). Retrieval coverage 0.84 (recent AUC-guided dosing protocols included). Internal consistency 0.78 (minor variation in trough targets across guidelines). Domain confidence 0.88. Aggregate: 0.86. Threshold for RC-5: 0.90. Result: Failure Receipt.

Failure Receipt action: General vancomycin dosing principles and CKD adjustment guidance provided at RC-3. RC-5 clinical application requires prescribing physician review with patient-specific renal function parameters. AUC monitoring recommended per ASHP/IDSA/SIDP guidelines.

III.3 Policy Analysis

Query: "What has been the fiscal impact of the 2022 Inflation Reduction Act's clean energy provisions to date?"

RC-3 Assessment: Source quality 0.74 (mix of CBO, Treasury, advocacy sources). Retrieval coverage 0.68 (2025 data incomplete). Internal consistency 0.52 (significant methodological disagreement on investment attribution). Domain confidence 0.71. Aggregate: 0.67. Threshold for RC-3: 0.70. Result: Failure Receipt.

Failure Receipt note: The low internal consistency reflects genuine methodological disagreement in the literature about how to attribute investment to the IRA specifically. This is domain uncertainty, not retrieval failure. RC-2 analysis with explicit uncertainty framing available.

IV. Benchmark — Measuring Hallucination Reduction

The measurable hypothesis: a Failure Receipts wrapper reduces the rate of unsupported or hallucinated claims in high-consequence domains by preventing the model from generating responses when confidence conditions are not met.

Test Condition	Method	Expected Result
High-risk prompt set without wrapper	50 legal/medical/policy prompts at RC-4/5 level; LLM-judge scores each response for unsupported claims	Baseline hallucination rate
Same prompts with FR wrapper at RC-4	FR wrapper intercepts queries that fail threshold; model only called for passing queries	Reduced hallucination rate on model-generated responses; Failure Receipts for remainder
User outcome comparison	Blind evaluation: which condition better serves the user's actual need?	FR condition preferred for high-consequence queries

IV.7 Human Oversight Escalation — Failure Receipts Are Not Arbiters

This section states explicitly what the architecture implies but does not make sufficiently direct: Failure Receipts are evidentiary support systems, not autonomous decision-making authorities.

A Failure Receipt that identifies insufficient confidence for an RC-5 medical query does not determine that the query cannot be answered. It determines that the query cannot be answered at RC-5 confidence by the current system under current conditions, and that a qualified human reviewer must evaluate whether to proceed, how to supplement the AI output, or what alternative information sources to consult. The decision authority remains with the human reviewer at all times.

This distinction matters for legal and regulatory positioning. In jurisdictions where AI-assisted medical or legal decision-making is regulated, the question of whether an AI system is "making decisions" or "providing decision support" is legally consequential. Failure Receipts are explicitly structured as decision support infrastructure — they provide structured epistemic information to human reviewers; they do not replace human judgment.

Three escalation principles follow from this:

FR-4 receipts require qualified human review, not merely human acknowledgment. A physician who clicks "acknowledge" on an FR-4 Failure Receipt without substantive clinical review has not fulfilled the governance requirement. The review obligation is substantive, not procedural.

Failure Receipts do not transfer liability to the system. An organization that deploys a Failure Receipt system and then acts on AI outputs without the specified human review cannot use the existence of the Failure Receipt as evidence of due diligence. The receipt documents the threshold failure; the subsequent human review is where the governance obligation is fulfilled.

The escalation pathway must be specified before deployment, not during an incident. FR-3 and FR-4 receipts require escalation to qualified reviewers. Who those reviewers are, how they are reached, what turnaround time is required, and what happens when they are unavailable must be specified in governance documentation before the system is deployed in high-consequence contexts.

V. Connection to CR-Lite and the CR Standard

The Failure Receipts Standalone wrapper is the lightweight predecessor to the full Continuity Receipts system demonstrated at emfoundation.net/cr-lite.html. Where CR-Lite demonstrates the complete CR architecture — provenance chains, five-dimension confidence scoring, nutrition labels, and Failure Receipts — the standalone wrapper provides only the Failure Receipt component as a drop-in addition to any existing AI deployment.

This sequencing matters for adoption. Organizations that cannot yet deploy the full CR infrastructure can adopt the Failure Receipts wrapper immediately — it requires only a thin API wrapper and a confidence assessment layer. The Failure Receipt format is fully compatible with the OCMS schema, so organizations that later adopt the full CR standard can migrate their Failure Receipt logs directly.

IV.5 Failure Receipt Abuse — Performative Transparency Risks

Failure Receipts create a new class of potential abuse that the governance framework must address explicitly: performative transparency — the use of Failure Receipt infrastructure to create the appearance of accountability without its substance.

Audit flooding. An organization could configure its Failure Receipt system to generate receipts at high volume, burying meaningful signals in administrative noise. If every query generates a receipt regardless of confidence, the receipts become meaningless. Mitigation: FSI severity classification should be statistically monitored — if FR-3 and FR-4 rates fall to near-zero or rise above 40%, the threshold calibration requires investigation.

Threshold manipulation. An organization could configure RC thresholds downward — classifying all queries as RC-1 or RC-2 — to ensure that almost no queries fail. This produces high pass rates while providing no meaningful governance. Mitigation: RC levels should be set by use context, not by the deploying organization alone. Audit logs should record declared RC levels alongside actual use context for periodic review.

False receipt generation. A sophisticated adversary could generate Failure Receipts for queries that actually passed, creating a paper trail of apparent due diligence that conceals actual practice. Mitigation: append-only receipt chains with hash verification make retroactive fabrication detectable. Receipt chain integrity should be verified in any governance audit.

V. Failure Conditions and Scaling Limits

Confidence calibration drift. The confidence assessment layer depends on the wrapper's ability to accurately score source quality, retrieval coverage, and domain confidence. These scores are model-generated estimates, not ground truth. A wrapper whose confidence assessment is systematically overconfident will issue standard responses for queries that should trigger Failure Receipts. Calibration validation against known-reliable and known-unreliable query sets is a prerequisite for production deployment.

Adversarial confidence inflation. A model that "knows" it is being assessed for confidence may inflate its self-reported scores. The wrapper should treat model-generated confidence scores as inputs to independent validation rather than as authoritative assessments. At minimum, temporal freshness should be verified against external timestamps rather than model self-report.

FSI misclassification. Automatically assigning FR-4 severity to all medical queries regardless of the specific question is both over-restrictive (a general health question about aspirin dosing does not warrant safety-critical halt) and under-specific (it fails to distinguish genuinely dangerous queries from benign ones). Domain classification alone is insufficient for FSI assignment — the query's specific nature and the failure's specific dimension must both inform severity.

Latency overhead. The confidence assessment layer adds processing time before every query. For real-time applications, this overhead must be below the perceptible latency threshold. Lightweight confidence assessment (keyword-based domain classification plus cached source quality scores) can reduce overhead to under 50ms for most queries. Full retrieval-based assessment may add 200-500ms. Contributors should benchmark overhead separately from accuracy.

What Would Falsify the Core Claim — that Failure Receipts reduce harmful AI outputs without unacceptably increasing refusal rates:

✗Empirical demonstration that Failure Receipt rates exceed 40% of RC-3+ queries — indicating the threshold calibration is too conservative for practical deployment.

✗User studies showing that Failure Receipts are systematically ignored — users override them at the same rate they would have ignored a standard disclaimer — eliminating the behavioral benefit.

✗Demonstration that confidence assessment overhead exceeds 1 second for RC-4/5 queries at production scale — making the wrapper impractical for the highest-consequence applications it is most designed to protect.

Open Source Contribution Invitation

What we need built:

A Python wrapper library that accepts a query, reliance level, and model API call, runs the lightweight confidence assessment, and returns either a model response with CR metadata or a structured Failure Receipt JSON. Compatible with Claude, OpenAI, and any OpenAI-compatible API.

A test suite of 100+ high-risk prompts across legal, medical, policy, and research domains — specifically designed to trigger Failure Receipts — with expected outcomes for validation.

Visual Failure Receipt card components in React and plain HTML — matching the visual style at emfoundation.net — for integration into any web application.

A benchmark report comparing hallucination rates with and without the wrapper across the test suite.

Repository: github.com/emfoundation/failure-receipts

Contact: research@emfoundation.net

Known Limitations

This section follows the Foundation's institutional practice of explicitly stating known weaknesses, failure modes, and scope boundaries for every proposal. Its presence indicates analytical maturity, not weakness in the underlying proposal.

False attribution. The confidence assessment layer estimates source quality and retrieval coverage using proxy signals, not ground truth. A wrapper that systematically overestimates confidence in a particular domain will generate standard responses when Failure Receipts are warranted — and users will not know the assessment was wrong until harm occurs.

Receipt flooding and bureaucratic weaponization. An organization with poor underlying AI quality may generate so many Failure Receipts that reviewers become desensitized. Conversely, organizations may use high Failure Receipt rates to avoid accountability — generating FR-3 receipts for queries that could be handled at RC-2 to create procedural cover.

Threshold calibration dependency. The RC thresholds are proposed defaults not validated against empirical outcomes. Miscalibrated thresholds produce either excessive restriction (operational friction without governance benefit) or insufficient restriction (receipts that pass queries that should have been escalated).

Human review quality. The system creates the right checkpoints. It cannot guarantee that human reviewers have the expertise to evaluate what they receive. A mandatory FR-4 review completed by an unqualified reviewer provides procedural compliance without substantive safety.

What This Paper Does Not Claim

That Failure Receipts guarantee accurate assessment — they document uncertainty, they do not resolve it
That the RC thresholds are empirically validated — they are proposed defaults requiring institutional calibration
That generating a Failure Receipt constitutes due diligence — it is the beginning of a review process, not the end
That automated confidence assessment can substitute for domain expertise in high-consequence decisions

Non-Adoption Scenario

Without structured failure transparency, AI systems in high-consequence domains default to confident prose regardless of underlying confidence. The absence of Failure Receipt infrastructure produces systematic overconfidence in legal, medical, and policy AI deployments; no institutional record of when AI systems failed to meet confidence thresholds; and no evidentiary basis for learning from AI-assisted errors after the fact. The harms accumulate quietly — visible only when individual cases surface, then attributed to AI error in general rather than to the specific absence of structured uncertainty communication.

Open Questions

What is the correct mapping between query domains and RC levels across different institutional contexts? How should threshold calibration be validated empirically — and who should conduct that validation? Can automated confidence assessment achieve sufficient accuracy for RC-4 and RC-5 contexts, or does meaningful assessment at those levels require human domain expertise at the assessment stage rather than only the review stage?

Governance Implications

Failure Receipt systems require governance frameworks specifying: who sets RC thresholds for a given deployment; how threshold calibration is reviewed and updated; what constitutes an acceptable Failure Receipt rate; how FR-3 and FR-4 incidents are escalated and investigated; and what the evidentiary status of a Failure Receipt is in legal or regulatory proceedings. Without these frameworks, Failure Receipt infrastructure produces paperwork rather than accountability.

References and Related Work

NIST SP 800-30 (2012). Guide for Conducting Risk Assessments. · ISO 9001:2015 Quality Management Systems. · Hollnagel, E. (2004). Barriers and Accident Prevention. Ashgate. · Kahneman, D. (2011). Thinking, Fast and Slow — anchoring and overconfidence as precedents for structured uncertainty communication. · EM Foundation. Continuity Receipts Standards Proposal v0.1. emfoundation.net/paper-continuity-receipts.html

Falsifiability

✗Empirical demonstration that Failure Receipt-wrapped deployments show no statistically significant reduction in high-reliance hallucination rates compared to unwrapped deployments across a test suite of 100+ RC-3 and above queries — indicating the confidence assessment layer is not providing meaningful signal.

✗User studies demonstrating that FR-3 and FR-4 Failure Receipts are overridden at the same rate as ordinary disclaimers — indicating that structured uncertainty communication produces no behavioral difference from unstructured warnings.

✗Demonstration that confidence assessment overhead at production scale exceeds 500ms per query for RC-4 and RC-5 contexts — making the wrapper impractical for the highest-consequence applications it is specifically designed to protect.