Continuity Compression

Abstract

Large language model applications typically manage context through one of two approaches: brute-force long-context prompting, which preserves everything at high token cost, or semantic caching, which deduplicates similar content but treats all context as equally valuable. Neither approach distinguishes between context that is merely old or redundant and context that is continuity-critical — the prior information whose removal would materially increase the risk of contradiction, repeated error, source loss, or reasoning drift.

Continuity Compression proposes a third approach: a continuity-aware compression layer that classifies memory fragments by type, assigns each a continuity score, and retains only the context above a relevance threshold plus top-k semantically relevant fragments. The method does not replace semantic caching — it adds a continuity-priority layer that semantic caching alone cannot provide.

This paper presents the conceptual design, the memory card taxonomy, the Continuity Compression Score formula, the benchmark protocol, and an open-source contribution invitation. The hypothesis is testable by any contributor with access to multi-turn conversation logs and an open-source or API-based language model.

I. The Problem — Not All Context Is Equal

Every multi-turn AI conversation accumulates context. After ten turns, a conversation may contain greetings, clarifications, corrections, contradictions, decisions, source references, stated preferences, and repetitions of earlier content. Current context management approaches treat this accumulation in one of two ways.

Full-context prompting sends everything. It preserves all information but costs tokens proportional to conversation length — costs that compound across many conversations and eventually become prohibitive for long-running sessions or applications with large user bases.

Semantic caching deduplicates similar content, reducing token use when conversations revisit similar ground. It works well for factual repetition but has a structural limitation: it optimizes for similarity, not significance. A user's correction of a factual error is not similar to any prior content — it is new, brief, and critical. Semantic caching has no mechanism to recognize that this fragment, unlike a repeated greeting or redundant explanation, must be preserved at any cost.

The insight driving Continuity Compression is simple: some context is decorative, redundant, or obsolete. Some context is continuity-critical. The difference is not a matter of recency or semantic similarity — it is a matter of what would happen to the quality of future reasoning if the fragment were removed.

A continuity-critical memory is any prior information that, if removed, would materially increase the risk of contradiction, repeated error, source loss, user-preference violation, or reasoning drift. These include corrections the user has made, unresolved contradictions the system has acknowledged, prior decisions that constrain future responses, provenance chains for claims the system has made, commitments the system has entered, ethical boundaries the user has established, reliance assumptions about how the output will be used, and user-specific calibration that distinguishes this user's needs from a generic user's.

Everything else — greetings, redundant explanations, stale details that have been superseded, phrasing variations of already-preserved content — is a candidate for compression or discarding.

I.5 Related Work — How Continuity Compression Differs

Continuity Compression builds on existing context management research but differs in a specific and important way. Understanding the distinction is necessary to evaluate whether the approach adds genuine value.

Semantic caching (GPT Semantic Cache, MeanCache) deduplicates similar queries by caching responses for semantically equivalent inputs. It optimizes for query-level similarity, not context-level continuity. A user correction — "the deadline is March 15, not March 30" — is semantically dissimilar from anything previously cached and therefore will not be deduplicated. But semantic caching also has no mechanism to recognize that this fragment, once present, must be preserved in all future prompts regardless of age or similarity. Continuity Compression adds this mechanism above the caching layer.

RAG (Retrieval-Augmented Generation) retrieves relevant external documents at query time to ground responses in authoritative sources. RAG addresses the source quality dimension of context but does not address conversational continuity. A RAG system that retrieves excellent sources does not thereby preserve user corrections, decisions, or unresolved contradictions from prior turns. Continuity Compression is orthogonal to RAG — they address different layers of the context problem and can be combined.

Long-context models (Claude's 200k token context, Gemini's 1M token context) address the length constraint directly by simply including more context. They reduce the need for compression but do not eliminate it — cost and latency grow with context length, and even long-context models benefit from continuity-weighted retrieval that surfaces the most relevant prior context before the cutoff. Continuity Compression is a complement to long-context models, not a substitute for them.

The key distinction: existing approaches optimize for similarity, recency, or length. Continuity Compression optimizes for continuity significance — what, if removed, would most degrade future reasoning quality. This is a different optimization target, not a incremental improvement on existing targets.

II. Memory Card Taxonomy

The foundation of Continuity Compression is a seven-type memory card taxonomy. Every segment of a conversation is classified into one of these types before the continuity score is assigned:

Card Type	Definition	Default Priority
Correction	User has corrected a factual error, misunderstanding, or mischaracterization by the system	High — always retain
Contradiction	System has acknowledged or identified an unresolved conflict between claims	High — retain until resolved
Provenance	Source reference, citation, or attribution for a claim made by the system	High for RC-3+; moderate for RC-1/2
Decision	A choice or commitment made during the conversation that constrains future responses	High — retain for session
Preference	User-stated preference about format, depth, style, or content that should persist	Moderate — retain with decay
Task context	Relevant background about the task that is not captured elsewhere	Moderate — subject to recency weighting
Noise	Greetings, filler, redundant restatements, superseded content	Low — candidate for discard

Figure 1 — Continuity Compression architecture. Each conversation turn is classified into memory card types, scored by CCS, and routed to retain, compress, or discard. Retained cards plus top-k semantic retrieval from the compressed pool form the context for the next prompt.

III. The Continuity Compression Score

Each memory card receives a Continuity Compression Score (CCS) that determines whether it is retained, compressed, or discarded before the next prompt.

Continuity Compression Score — Conceptual DefinitionCCS = (Correction + Contradiction + Provenance + Decision + Preference + Recency)
      / Redundancy

Where:

Correction   = 1 if card is type Correction, else 0
Contradiction= 1 if card is type Contradiction (unresolved), else 0
Provenance   = reliance_weight × 1 if card is type Provenance
               (reliance_weight: RC-4/5 = 1.0, RC-3 = 0.7, RC-1/2 = 0.3)
Decision     = 1 if card is type Decision (active), else 0
Preference   = recency_decay(age) if card is type Preference
               (exponential decay: 1.0 at turn 0, 0.5 at turn 10, 0.1 at turn 30)
Recency      = recency_decay(age) for all card types
               (slower decay than Preference: 0.5 at turn 20)

Redundancy   = semantic_similarity_to_retained_cards
               (cosine similarity against already-retained card embeddings)
               Range: 1.0 (fully redundant) to 0.1 (unique)
               Minimum value: 0.1 to prevent division explosion

Retention threshold:  CCS > 0.4  → retain
Compression zone:     CCS 0.2–0.4 → compress to summary
Discard zone:         CCS < 0.2  → discard

Note: This is a conceptual formula for experimental validation,
not a finalized production specification. Contributors are
encouraged to test alternative weighting schemes.

IV. Benchmark Protocol

The hypothesis that Continuity Compression reduces token cost while preserving reasoning quality is testable with a straightforward three-pipeline comparison. Any contributor with access to multi-turn conversation logs and an API-based language model can run this benchmark.

Pipeline A — Full Context Baseline

Every prior turn included in every prompt. Token count compounds linearly with conversation length. This is the worst-case cost and the best-case continuity preservation ceiling.

Pipeline B — Semantic Cache Baseline

Deduplication using cosine similarity of sentence embeddings. Similar turns collapsed to a single representative. Standard implementation using a vector store.

Pipeline C — Continuity Compression

Memory cards classified, scored, and filtered before each prompt. Retained cards plus top-k semantically relevant cards from compressed or discarded pool (available for emergency retrieval if needed).

Measurement Dimensions

Metric	How to Measure	Expected Direction
Token usage	Count tokens in each prompt across pipelines	C < B < A
Latency	Time from prompt submission to response start	C ≤ B < A
Correction retention	After a correction in turn N, does the system apply it in turn N+5?	A = C > B
Contradiction retention	After an acknowledged contradiction, does the system still flag it 10 turns later?	A = C > B
Provenance retention	Can the system cite the source it used in turn N when asked in turn N+8?	A = C > B
Answer quality	Human or LLM-judge rating of response accuracy and coherence	A ≈ C > B

V. Failure Modes

Over-compression. Aggressive CCS thresholds may discard context that is genuinely needed but does not fit neatly into high-priority card types. Mitigation: maintain a compressed summary pool that can be retrieved if a query keyword matches a discarded card.

Stale preference preservation. A user preference from turn 2 that is no longer relevant at turn 50 may continue consuming context budget. Mitigation: recency decay on Preference cards, with explicit user override mechanism.

Privacy retention risk. High-priority cards such as Corrections and Decisions may contain sensitive personal information that should not persist beyond the session. Mitigation: session-scoped card stores with explicit expiry; no persistent storage without explicit user consent.

Classification errors. Misclassifying a Correction as Task Context significantly reduces its CCS and risks discard. Mitigation: dual-pass classification with confidence threshold; ambiguous cards default to higher-priority type.

VI. Connection to Foundation Research

Continuity Compression is the infrastructure-layer expression of the Foundation's central thesis. The Continuity Receipts framework attaches provenance metadata to AI outputs. Continuity Compression preserves the provenance context within the reasoning process that produces those outputs. Together they address the full lifecycle of continuity in AI-assisted reasoning: what is preserved during generation, and what is documented about the result.

The memory card taxonomy directly extends the Chronicle architecture from the ARIA Framework. Where the ARIA Chronicle is an append-only developmental record for a persistent AI instance, the Continuity Compression card store is a session-scoped continuity record for any AI conversation. The classification principles — what matters, what can be discarded, what must be preserved at all costs — are the same in both architectures.

V. Computational Overhead and Tradeoffs

Classification latency. Memory card classification requires a pass over each conversation turn. Using a lightweight classifier (rule-based keyword matching plus a small fine-tuned classifier) this adds approximately 5-20ms per turn. Using an LLM-based classifier adds 100-300ms but achieves higher accuracy on ambiguous cases. The benchmark should test both approaches.

Embedding computation. The redundancy calculation requires embedding each new card against existing retained cards. Using cached embeddings and approximate nearest-neighbor search, this adds approximately 10-50ms per turn at moderate context lengths. The overhead grows with context size but remains sublinear.

The continuity-correction tradeoff. Aggressive compression thresholds reduce token costs but increase the risk of discarding context that was continuity-critical. The CCS formula's redundancy penalty is the primary risk factor — a card that is semantically similar to another retained card may be discarded even if it contains a different piece of genuinely critical information. The benchmark must test false-discard rate alongside token reduction.

Token savings projection. Preliminary analysis of typical multi-turn conversations suggests that 40-60% of context is classifiable as Noise or low-priority Task context. A threshold of 0.40 is projected to reduce token usage by 35-50% while retaining all Correction, Contradiction, and Decision cards. This projection requires empirical validation across diverse conversation types.

V.5 Theoretical Benchmarks

While empirical validation is pending, the following theoretical benchmarks provide measurable performance criteria for the open-source implementation to test against.

Metric	Projected Value	Measurement Method
Token reduction vs full-context	35–50% for typical multi-turn conversations	Token count comparison across 50+ test conversations
Correction retention rate	>95% at threshold 0.40	Inject 10 corrections per conversation; verify persistence at turn N+5
Contradiction retention rate	>90% at threshold 0.40	Inject 5 acknowledged contradictions; verify flagging at turn N+10
Answer quality delta vs full-context	<10% degradation	LLM-judge scoring on identical queries across three pipelines
Classification latency (lightweight)	5–20ms per turn	Benchmark keyword-based classifier on 1000 turns
Classification latency (LLM-based)	100–300ms per turn	Benchmark API-based classifier on 1000 turns
False-discard rate	<5% of Correction cards	Manual review of discarded cards across 20 test conversations

These projections are based on structural analysis of the CCS formula rather than measured results. The benchmark harness requested in the contribution invitation is the mechanism for validating or revising these projections.

V.6 Compression Corruption — When Compression Harms Continuity

This is a philosophically important section that the paper must address honestly: compression systems are not neutral. The choices embedded in the CCS formula determine whose context is preserved and whose is discarded, with consequences that extend beyond technical performance.

Minority context erasure. In a conversation involving multiple perspectives, the CCS formula's redundancy penalty may systematically disadvantage minority viewpoints. If one perspective is expressed by many turns and another by few, the minority perspective accumulates lower CCS scores through both lower recency and higher apparent redundancy with the dominant narrative. This is not a design intention — it is a structural consequence of optimizing for continuity of the most-expressed context.

Causal chain disruption. A sequence of low-CCS turns may collectively constitute a causal chain — each step individually appearing as task context or noise while together establishing the reasoning that led to a decision. The CCS formula scores individual cards; it does not score sequences. Compression may discard the chain while retaining the decision, making the decision appear unmotivated.

Dominant narrative amplification. Repeated framing of a problem in particular terms accumulates in retained context as high-recency, low-redundancy task context. Alternative framings introduced once may score as noise relative to the dominant frame. The compression system may thereby amplify the first framing of a problem at the expense of later corrections to that framing.

These risks do not invalidate the Continuity Compression proposal — they constrain it. The CCS formula should be tested specifically for these failure modes, and the threshold calibration should incorporate sensitivity to minority context preservation alongside the primary efficiency metrics.

Known Limitations

This section follows the Foundation's institutional practice of explicitly stating known weaknesses, failure modes, and scope boundaries for every proposal. Its presence indicates analytical maturity, not weakness in the underlying proposal.

Classification accuracy ceiling. The memory card classifier must correctly identify Correction cards to protect them from compression. Misclassified corrections — discarded rather than retained — are the most dangerous failure mode. No classifier achieves perfect accuracy on ambiguous input, and the cost of Correction misclassification is asymmetric with the cost of Noise misclassification.

Compression is not semantically neutral. The CCS formula embeds value judgments about what matters. Conversations in which meaning is primarily carried by implicit context, relationship history, or cumulative nuance may be damaged by compression even when individual card scores appear appropriate.

Stale preference preservation. Preference cards with exponential decay may retain outdated preferences users have implicitly abandoned. A preference stated at turn 3 and never revisited may still influence turn 50 through a card that scores above threshold due to initial high weight.

Session scope only. The proposal is session-scoped. The same correction made in session 1 is invisible to session 2 unless a separate cross-session continuity layer exists. This limitation is by design but must be stated explicitly to avoid misapplication.

What This Paper Does Not Claim

That CCS scores are objective measures of continuity importance — they are heuristic estimates subject to calibration error
That the projected 35–50% token reduction will hold across all conversation types — this is a structural estimate requiring empirical validation
That Continuity Compression replaces the need for long-context models — it is a complement for sessions where cost and latency matter
That compressed context produces equivalent answer quality to full context — degradation under specific conditions is expected and requires measurement

Non-Adoption Scenario

Without continuity-aware context management, LLM applications face a binary choice between full-context prompting (prohibitive cost at scale) and semantic caching (cost reduction that systematically discards continuity-critical context). The absence of a middle path means that corrections, acknowledged contradictions, and prior decisions are routinely lost from AI context — producing re-explanation burden, repeated error, and reasoning drift that accumulates across millions of interactions without surfacing as attributable failures.

Open Questions

What is the correct CCS threshold for different use contexts — and should the threshold be adaptive rather than fixed? How should the classifier handle multi-type cards that are simultaneously a Correction and a Provenance reference? Can CCS scoring be made transparent to users — showing them which context is being retained and why? What is the minimum benchmark dataset size required for meaningful empirical validation?

Governance Implications

Continuity Compression systems that retain Preference cards raise data retention questions requiring governance decisions: how long should preferences be retained, under what conditions should they be discarded, and who controls the retention policy. Any extension to cross-session continuity requires explicit governance frameworks for what is stored, for how long, and with what user visibility.

References and Related Work

Ge, T. et al. (2024). In-Context Autoencoder for Context Compression in a Large Language Model. ICLR. · Shi, F. et al. (2024). MeanCache: User-Centric Semantic Caching for LLM Web Services. arXiv. · Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. · EM Foundation Technical Lexicon v1.0. emfoundation.net/technical-lexicon.html

VI. Falsifiability

✗Correction retention below 95% at any threshold tested — indicating the card classifier is failing to identify and protect the most critical continuity fragments.

✗Token reduction below 20% across all tested conversation types — indicating the compression is not providing meaningful cost savings over semantic cache alone.

✗Answer quality degradation exceeding 10% versus full-context baseline — indicating that discarded context was more continuity-relevant than the CCS score predicted.

✗Classification latency exceeding 100ms per turn with the lightweight classifier — making the overhead unjustifiable for real-time applications.

Open Source Contribution Invitation

What we need built:

A Python benchmark harness that ingests multi-turn conversation logs, classifies turns into memory card types, assigns CCS scores, runs all three pipelines, and outputs comparison metrics for token usage, latency, correction retention, contradiction retention, and provenance retention.

A simple Streamlit or Next.js visualization interface showing which cards were retained, compressed, or discarded for a given conversation — with the CCS score displayed for each.

A synthetic benchmark dataset of 50+ multi-turn conversations designed specifically to test continuity-sensitive tasks: conversations with deliberate corrections, acknowledged contradictions, stated preferences, and explicit provenance requirements.

Repository: github.com/emfoundation/continuity-compression

Contact: research@emfoundation.net