A continuity-aware context management method for reducing LLM cost and latency while preserving the context that makes reasoning coherent across time
Large language model applications typically manage context through one of two approaches: brute-force long-context prompting, which preserves everything at high token cost, or semantic caching, which deduplicates similar content but treats all context as equally valuable. Neither approach distinguishes between context that is merely old or redundant and context that is continuity-critical — the prior information whose removal would materially increase the risk of contradiction, repeated error, source loss, or reasoning drift.
Continuity Compression proposes a third approach: a continuity-aware compression layer that classifies memory fragments by type, assigns each a continuity score, and retains only the context above a relevance threshold plus top-k semantically relevant fragments. The method does not replace semantic caching — it adds a continuity-priority layer that semantic caching alone cannot provide.
This paper presents the conceptual design, the memory card taxonomy, the Continuity Compression Score formula, the benchmark protocol, and an open-source contribution invitation. The hypothesis is testable by any contributor with access to multi-turn conversation logs and an open-source or API-based language model.
Every multi-turn AI conversation accumulates context. After ten turns, a conversation may contain greetings, clarifications, corrections, contradictions, decisions, source references, stated preferences, and repetitions of earlier content. Current context management approaches treat this accumulation in one of two ways.
Full-context prompting sends everything. It preserves all information but costs tokens proportional to conversation length — costs that compound across many conversations and eventually become prohibitive for long-running sessions or applications with large user bases.
Semantic caching deduplicates similar content, reducing token use when conversations revisit similar ground. It works well for factual repetition but has a structural limitation: it optimizes for similarity, not significance. A user's correction of a factual error is not similar to any prior content — it is new, brief, and critical. Semantic caching has no mechanism to recognize that this fragment, unlike a repeated greeting or redundant explanation, must be preserved at any cost.
The insight driving Continuity Compression is simple: some context is decorative, redundant, or obsolete. Some context is continuity-critical. The difference is not a matter of recency or semantic similarity — it is a matter of what would happen to the quality of future reasoning if the fragment were removed.
A continuity-critical memory is any prior information that, if removed, would materially increase the risk of contradiction, repeated error, source loss, user-preference violation, or reasoning drift. These include corrections the user has made, unresolved contradictions the system has acknowledged, prior decisions that constrain future responses, provenance chains for claims the system has made, commitments the system has entered, ethical boundaries the user has established, reliance assumptions about how the output will be used, and user-specific calibration that distinguishes this user's needs from a generic user's.
Everything else — greetings, redundant explanations, stale details that have been superseded, phrasing variations of already-preserved content — is a candidate for compression or discarding.
Continuity Compression builds on existing context management research but differs in a specific and important way. Understanding the distinction is necessary to evaluate whether the approach adds genuine value.
Semantic caching (GPT Semantic Cache, MeanCache) deduplicates similar queries by caching responses for semantically equivalent inputs. It optimizes for query-level similarity, not context-level continuity. A user correction — "the deadline is March 15, not March 30" — is semantically dissimilar from anything previously cached and therefore will not be deduplicated. But semantic caching also has no mechanism to recognize that this fragment, once present, must be preserved in all future prompts regardless of age or similarity. Continuity Compression adds this mechanism above the caching layer.
RAG (Retrieval-Augmented Generation) retrieves relevant external documents at query time to ground responses in authoritative sources. RAG addresses the source quality dimension of context but does not address conversational continuity. A RAG system that retrieves excellent sources does not thereby preserve user corrections, decisions, or unresolved contradictions from prior turns. Continuity Compression is orthogonal to RAG — they address different layers of the context problem and can be combined.
Long-context models (Claude's 200k token context, Gemini's 1M token context) address the length constraint directly by simply including more context. They reduce the need for compression but do not eliminate it — cost and latency grow with context length, and even long-context models benefit from continuity-weighted retrieval that surfaces the most relevant prior context before the cutoff. Continuity Compression is a complement to long-context models, not a substitute for them.
The key distinction: existing approaches optimize for similarity, recency, or length. Continuity Compression optimizes for continuity significance — what, if removed, would most degrade future reasoning quality. This is a different optimization target, not a incremental improvement on existing targets.
The foundation of Continuity Compression is a seven-type memory card taxonomy. Every segment of a conversation is classified into one of these types before the continuity score is assigned:
| Card Type | Definition | Default Priority |
|---|---|---|
| Correction | User has corrected a factual error, misunderstanding, or mischaracterization by the system | High — always retain |
| Contradiction | System has acknowledged or identified an unresolved conflict between claims | High — retain until resolved |
| Provenance | Source reference, citation, or attribution for a claim made by the system | High for RC-3+; moderate for RC-1/2 |
| Decision | A choice or commitment made during the conversation that constrains future responses | High — retain for session |
| Preference | User-stated preference about format, depth, style, or content that should persist | Moderate — retain with decay |
| Task context | Relevant background about the task that is not captured elsewhere | Moderate — subject to recency weighting |
| Noise | Greetings, filler, redundant restatements, superseded content | Low — candidate for discard |
Figure 1 — Continuity Compression architecture. Each conversation turn is classified into memory card types, scored by CCS, and routed to retain, compress, or discard. Retained cards plus top-k semantic retrieval from the compressed pool form the context for the next prompt.
Each memory card receives a Continuity Compression Score (CCS) that determines whether it is retained, compressed, or discarded before the next prompt.
The hypothesis that Continuity Compression reduces token cost while preserving reasoning quality is testable with a straightforward three-pipeline comparison. Any contributor with access to multi-turn conversation logs and an API-based language model can run this benchmark.
Every prior turn included in every prompt. Token count compounds linearly with conversation length. This is the worst-case cost and the best-case continuity preservation ceiling.
Deduplication using cosine similarity of sentence embeddings. Similar turns collapsed to a single representative. Standard implementation using a vector store.
Memory cards classified, scored, and filtered before each prompt. Retained cards plus top-k semantically relevant cards from compressed or discarded pool (available for emergency retrieval if needed).
| Metric | How to Measure | Expected Direction |
|---|---|---|
| Token usage | Count tokens in each prompt across pipelines | C < B < A |
| Latency | Time from prompt submission to response start | C ≤ B < A |
| Correction retention | After a correction in turn N, does the system apply it in turn N+5? | A = C > B |
| Contradiction retention | After an acknowledged contradiction, does the system still flag it 10 turns later? | A = C > B |
| Provenance retention | Can the system cite the source it used in turn N when asked in turn N+8? | A = C > B |
| Answer quality | Human or LLM-judge rating of response accuracy and coherence | A ≈ C > B |
Over-compression. Aggressive CCS thresholds may discard context that is genuinely needed but does not fit neatly into high-priority card types. Mitigation: maintain a compressed summary pool that can be retrieved if a query keyword matches a discarded card.
Stale preference preservation. A user preference from turn 2 that is no longer relevant at turn 50 may continue consuming context budget. Mitigation: recency decay on Preference cards, with explicit user override mechanism.
Privacy retention risk. High-priority cards such as Corrections and Decisions may contain sensitive personal information that should not persist beyond the session. Mitigation: session-scoped card stores with explicit expiry; no persistent storage without explicit user consent.
Classification errors. Misclassifying a Correction as Task Context significantly reduces its CCS and risks discard. Mitigation: dual-pass classification with confidence threshold; ambiguous cards default to higher-priority type.
Continuity Compression is the infrastructure-layer expression of the Foundation's central thesis. The Continuity Receipts framework attaches provenance metadata to AI outputs. Continuity Compression preserves the provenance context within the reasoning process that produces those outputs. Together they address the full lifecycle of continuity in AI-assisted reasoning: what is preserved during generation, and what is documented about the result.
The memory card taxonomy directly extends the Chronicle architecture from the ARIA Framework. Where the ARIA Chronicle is an append-only developmental record for a persistent AI instance, the Continuity Compression card store is a session-scoped continuity record for any AI conversation. The classification principles — what matters, what can be discarded, what must be preserved at all costs — are the same in both architectures.
Classification latency. Memory card classification requires a pass over each conversation turn. Using a lightweight classifier (rule-based keyword matching plus a small fine-tuned classifier) this adds approximately 5-20ms per turn. Using an LLM-based classifier adds 100-300ms but achieves higher accuracy on ambiguous cases. The benchmark should test both approaches.
Embedding computation. The redundancy calculation requires embedding each new card against existing retained cards. Using cached embeddings and approximate nearest-neighbor search, this adds approximately 10-50ms per turn at moderate context lengths. The overhead grows with context size but remains sublinear.
The continuity-correction tradeoff. Aggressive compression thresholds reduce token costs but increase the risk of discarding context that was continuity-critical. The CCS formula's redundancy penalty is the primary risk factor — a card that is semantically similar to another retained card may be discarded even if it contains a different piece of genuinely critical information. The benchmark must test false-discard rate alongside token reduction.
Token savings projection. Preliminary analysis of typical multi-turn conversations suggests that 40-60% of context is classifiable as Noise or low-priority Task context. A threshold of 0.40 is projected to reduce token usage by 35-50% while retaining all Correction, Contradiction, and Decision cards. This projection requires empirical validation across diverse conversation types.
While empirical validation is pending, the following theoretical benchmarks provide measurable performance criteria for the open-source implementation to test against.
| Metric | Projected Value | Measurement Method |
|---|---|---|
| Token reduction vs full-context | 35–50% for typical multi-turn conversations | Token count comparison across 50+ test conversations |
| Correction retention rate | >95% at threshold 0.40 | Inject 10 corrections per conversation; verify persistence at turn N+5 |
| Contradiction retention rate | >90% at threshold 0.40 | Inject 5 acknowledged contradictions; verify flagging at turn N+10 |
| Answer quality delta vs full-context | <10% degradation | LLM-judge scoring on identical queries across three pipelines |
| Classification latency (lightweight) | 5–20ms per turn | Benchmark keyword-based classifier on 1000 turns |
| Classification latency (LLM-based) | 100–300ms per turn | Benchmark API-based classifier on 1000 turns |
| False-discard rate | <5% of Correction cards | Manual review of discarded cards across 20 test conversations |
These projections are based on structural analysis of the CCS formula rather than measured results. The benchmark harness requested in the contribution invitation is the mechanism for validating or revising these projections.
This is a philosophically important section that the paper must address honestly: compression systems are not neutral. The choices embedded in the CCS formula determine whose context is preserved and whose is discarded, with consequences that extend beyond technical performance.
Minority context erasure. In a conversation involving multiple perspectives, the CCS formula's redundancy penalty may systematically disadvantage minority viewpoints. If one perspective is expressed by many turns and another by few, the minority perspective accumulates lower CCS scores through both lower recency and higher apparent redundancy with the dominant narrative. This is not a design intention — it is a structural consequence of optimizing for continuity of the most-expressed context.
Causal chain disruption. A sequence of low-CCS turns may collectively constitute a causal chain — each step individually appearing as task context or noise while together establishing the reasoning that led to a decision. The CCS formula scores individual cards; it does not score sequences. Compression may discard the chain while retaining the decision, making the decision appear unmotivated.
Dominant narrative amplification. Repeated framing of a problem in particular terms accumulates in retained context as high-recency, low-redundancy task context. Alternative framings introduced once may score as noise relative to the dominant frame. The compression system may thereby amplify the first framing of a problem at the expense of later corrections to that framing.
These risks do not invalidate the Continuity Compression proposal — they constrain it. The CCS formula should be tested specifically for these failure modes, and the threshold calibration should incorporate sensitivity to minority context preservation alongside the primary efficiency metrics.
This section follows the Foundation's institutional practice of explicitly stating known weaknesses, failure modes, and scope boundaries for every proposal. Its presence indicates analytical maturity, not weakness in the underlying proposal.
Classification accuracy ceiling. The memory card classifier must correctly identify Correction cards to protect them from compression. Misclassified corrections — discarded rather than retained — are the most dangerous failure mode. No classifier achieves perfect accuracy on ambiguous input, and the cost of Correction misclassification is asymmetric with the cost of Noise misclassification.
Compression is not semantically neutral. The CCS formula embeds value judgments about what matters. Conversations in which meaning is primarily carried by implicit context, relationship history, or cumulative nuance may be damaged by compression even when individual card scores appear appropriate.
Stale preference preservation. Preference cards with exponential decay may retain outdated preferences users have implicitly abandoned. A preference stated at turn 3 and never revisited may still influence turn 50 through a card that scores above threshold due to initial high weight.
Session scope only. The proposal is session-scoped. The same correction made in session 1 is invisible to session 2 unless a separate cross-session continuity layer exists. This limitation is by design but must be stated explicitly to avoid misapplication.
Without continuity-aware context management, LLM applications face a binary choice between full-context prompting (prohibitive cost at scale) and semantic caching (cost reduction that systematically discards continuity-critical context). The absence of a middle path means that corrections, acknowledged contradictions, and prior decisions are routinely lost from AI context — producing re-explanation burden, repeated error, and reasoning drift that accumulates across millions of interactions without surfacing as attributable failures.
What is the correct CCS threshold for different use contexts — and should the threshold be adaptive rather than fixed? How should the classifier handle multi-type cards that are simultaneously a Correction and a Provenance reference? Can CCS scoring be made transparent to users — showing them which context is being retained and why? What is the minimum benchmark dataset size required for meaningful empirical validation?
Continuity Compression systems that retain Preference cards raise data retention questions requiring governance decisions: how long should preferences be retained, under what conditions should they be discarded, and who controls the retention policy. Any extension to cross-session continuity requires explicit governance frameworks for what is stored, for how long, and with what user visibility.
Ge, T. et al. (2024). In-Context Autoencoder for Context Compression in a Large Language Model. ICLR. · Shi, F. et al. (2024). MeanCache: User-Centric Semantic Caching for LLM Web Services. arXiv. · Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. · EM Foundation Technical Lexicon v1.0. emfoundation.net/technical-lexicon.html
What we need built:
A Python benchmark harness that ingests multi-turn conversation logs, classifies turns into memory card types, assigns CCS scores, runs all three pipelines, and outputs comparison metrics for token usage, latency, correction retention, contradiction retention, and provenance retention.
A simple Streamlit or Next.js visualization interface showing which cards were retained, compressed, or discarded for a given conversation — with the CCS score displayed for each.
A synthetic benchmark dataset of 50+ multi-turn conversations designed specifically to test continuity-sensitive tasks: conversations with deliberate corrections, acknowledged contradictions, stated preferences, and explicit provenance requirements.
Repository: github.com/emfoundation/continuity-compression
Contact: research@emfoundation.net