Context as Environment: From Injection to Programmatic Navigation

An agent that receives too much context gives a worse answer. A persona that receives too much context becomes a different person.


1. The Problem

Large language models are stateless. Every API call arrives without memory of the last. The quality of any response is determined not by model capability alone but by the context assembled before the call: the system prompt, persona definition, retrieved documents, conversation history, user profile, and environmental signals that together tell the model what to be and what to know. The model is a fixed function; the prompt is the variable. Getting context right matters more than choosing the right model.

The industry treats context assembly as an agent problem: how to get the right information to the model so it completes a task. An agent needs knowledge and tools. A persona needs knowledge, tools, identity, voice, relationship history, and emotional awareness. Context assembly for agents optimizes for task relevance. Context assembly for personas optimizes for task relevance under identity constraints. Every piece of injected context competes with the persona definition for attention weight. An agent that receives too much context gives a worse answer. A persona that receives too much context becomes a different person. The failure mode is identity loss, not degraded accuracy.

Zhang, Kraska & Khattab's Recursive Language Models paper (MIT CSAIL, January 2026) surveys prior approaches to long-context processing and proposes a framework that treats the prompt as part of an external environment the model navigates programmatically.1Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026. The paper evaluates against context condensation, retrieval agents, coding agents, and sub-agent delegation, finding each insufficient. Their RLM framework formalizes three design choices: give the model a symbolic handle to the data, construct output programmatically rather than autoregressively, and enable symbolic recursion where code invokes the model on subsets of the input.

Prior Approaches

The RLM paper evaluates against three prior approaches: direct context loading, context condensation, and sub-agent delegation.1Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026. Each fails personas in specific ways.

Direct context loading dumps everything into the context window: all tool descriptions, all retrieval results, complete conversation history. The RLM paper frames the core problem: “arbitrarily long user prompts should not be fed into the neural network directly but should instead be treated as part of the environment.”1Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026.

Failure ModeMechanismPersona-Specific Impact
Attention decaySoftmax distributes weight across all tokens; early instructions lose influence at position 1000+Persona instructions at position 0 have diminishing effect on generation at position 5000+. The model forgets who it is.
Token competitionPersona, tools, context, and history compete for the same attention budget30%+ consistency degradation after 8 to 12 turns2Li, K., Liu, T., Bashkansky, N., et al. (2024). Measuring and Controlling Instruction (In)Stability in Language Model Dialogs. COLM 2024.
Cost scalingEvery injected token is billed on input; redundant context multiplies costLinear cost growth per injected source
Injection surfaceEvery injected document is an attack vector for prompt injectionUntrusted content can override persona instructions, causing identity hijacking
Recency biasModels favor recent tokens over early onesPersona definition at the start of the window fades; retrieved content at the end dominates

An agent that suffers attention decay produces less accurate answers. A persona that suffers attention decay loses its identity. The persona instructions that define who this system is sit at position 0, where attention decay hits hardest. Our own Persona Architecture documents this: LLaMA2-70B shows significant drift within 8 turns, and larger models experience greater identity drift.2Li, K., Liu, T., Bashkansky, N., et al. (2024). Measuring and Controlling Instruction (In)Stability in Language Model Dialogs. COLM 2024.3Choi, J., Hong, Y., Kim, M., & Kim, B. (2024). Examining Identity Drift in Conversations of LLM Agents.

Context condensation (the RLM paper's term, also called compaction) compresses context when it exceeds a length threshold.1Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026. The compression is lossy, and for personas, what gets lost matters. Compaction algorithms discard what looks redundant: repeated phrases, stylistic patterns, relationship details, emotional markers. These are the signals a persona needs to maintain identity and voice.

Sub-agent delegation (also the RLM paper's term) allows LLMs to invoke themselves as sub-agents, routing work to specialists.1Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026. The RLM paper identifies the remaining bottleneck: prior self-delegation approaches “are handicapped by the underlying LLM's limited output lengths because they are designed to verbalize sub-calls autoregressively rather than producing them programmatically.”1Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026. For personas, even when delegation compresses results effectively, the orchestrator still assembles those results into a single prompt where the persona definition loses attention weight against the retrieved content.

Our Position

We adopt and extend the RLM framework for personas. The additional constraint is that every context decision (what to fetch, what to filter) must preserve identity coherence across turns, sessions, and model swaps. The filtering layers described in this document improve task quality by removing irrelevant context (the agent benefit) and protect persona integrity by preventing context from diluting identity (the persona-specific benefit). The measurement layer (ICS, VCS, MCS, CFS, DR) validates that the second purpose held. No agent framework measures these because agents do not have identity to protect.

We are shipping a hybrid implementation today. The persona layer, cognitive pipeline, and decision architecture already treat context as something the system navigates rather than dumps. An LLM call decides which providers to activate and what queries to run, all before any context enters the response generation window.


2. Adopting the RLM Framework for Persona Context

The RLM paper identifies three design choices that separate effective context scaffolds from ineffective ones: a symbolic handle to the prompt, programmatic output construction, and symbolic recursion.1Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026. Our architecture draws on the first directly, extends the third in sequential form, and builds its central mechanism around a filtering pattern the paper observes as emergent behavior in effective scaffolds. The additional constraint at every step: filtering must preserve identity coherence, not just task relevance.

2.1 Symbolic Handle to Context

RLM design choice 1. The model receives a symbolic handle to the data rather than the data itself, manipulating it without copying it into the root context window.1Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026.

Our implementation. Two structures give the model a handle to context without dumping raw data into the window.

UserContextEnvelope assembles all available context into a typed structure. Four fetches run concurrently:

// user-context.service.ts:76-82
const [userProfile, capsuleContext, personaData, memoryContext] =
  await Promise.all([
    this.fetchUserProfile(params.userId),
    this.fetchCapsuleContext(params.userId, params.avatar.capsuleId),
    this.fetchPersona(params.avatar),
    this.fetchMemoryContext(avatarId, params.userId),
  ]);

The envelope contains user profile, persona data, capsule context, session summary, interest graph, privacy rules, relevance signals, and pre-reasoning hints. The cognitive pipeline receives an envelope summary, not the raw envelope. The model sees what context is available and decides what to pull in.

WorkingMemory stores context in typed, immutable regions:

WorkingMemory
├── core             character identity
├── persona          RICE definition, traits, voice
├── context          user profile, locale, timezone, session
├── history          conversation turns
├── reasoning        internal monologue (not sent to user)
├── web_context      search results
├── capsule_context  knowledge base data
└── summary          compressed conversation history
// working-memory.ts:84-86
getRegion(region: string): readonly MemoryEntry[] {
  return this._memories.filter((m) => m.region === region);
}

Each mutation returns a new WorkingMemory instance. The pipeline can filter regions, keep only specific ones, remove reasoning traces, and compress history without modifying the underlying data:

withoutReasoning(): WorkingMemory         // Strip internal monologue
keepRegions(...regions): WorkingMemory    // Keep only selected regions
compress(summary, keepRecent): WorkingMemory  // Compress old history, keep N recent

The cognitive pipeline navigates these regions, reading and writing to specific areas rather than processing a flat string. This is the symbolic handle: structured, typed, navigable context that the model accesses through methods, not through raw token consumption.

2.2 Programmatic Filtering

RLM observation. Effective scaffolds filter input using code execution based on model priors.1Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026. The RLM paper does not formalize this as one of its three design choices, but the pattern is central to its results: RLM(GPT-5) achieves 91.3% on BrowseComp-Plus at $0.99 average cost, outperforming a summary agent (70.5%) that costs more because it ingests context rather than filtering it. Spend compute on filtering, get better quality and lower cost.

Our implementation. Three filtering layers sit between available context and the response prompt. Each is driven by model decisions or model-scored heuristics.

Filtering LayerMechanismReduction
Provider activationplanContextRetrieval selects which providers fire based on intent and envelope metadata. A greeting does not trigger web search. A factual question does not pull device context. Providers that would contribute irrelevant context never execute.Providers reduced from all available to only those the LLM deems necessary
Remote relevance scoringCapsuleFilterService sends raw context to a Cloudflare Worker that runs a lightweight LLM to score and filter for relevance against the user's question. Only content the worker deems relevant survives.Variable; a 50K-character knowledge base can reduce to the relevant 3K
Focus compressionPreReasoningService.summarize() collapses all available context into a single targeting directive. The response-generating LLM sees a priority string, not the full source material.N context sources reduced to one focusing sentence

The prompt does not grow linearly with available context volume. It grows with query complexity, capped by relevance scoring. A user with 50 capsules and 3 web search providers does not produce a prompt 53x larger than a user with 1 capsule.

Layer 1: planContextRetrieval. The cognitive pipeline includes an LLM call that plans context retrieval before any context is fetched for the response. The model sees metadata about available providers and the user's intent, then decides which providers to activate and what queries to run:

// plan-context-retrieval.step.ts (buildPrompt)
`You are planning context retrieval for a response.
Only request providers that are allowed by packs/capabilities/consent; if unsure, request fewer.

ALLOWED PROVIDERS: ${allowedProviders}

USER MESSAGE: "${input.userMessage}"
${intentLine}

ENVELOPE SUMMARY:
${envelopeSummary}

Return JSON only:
{
  "providersNeeded": ["static" | "device" | "legacy" | "legacy_capsule" | "parallel"],
  "searchQueries": ["query 1", "query 2"],
  "retrievalGoal": "one sentence",
  "riskFlags": ["consent", "sensitive", "cost"]
}`
FieldTypePurpose
providersNeededstring[]Which providers to activate (max: allowed set)
searchQueriesstring[]Up to 3 queries, 160 chars each
retrievalGoalstringOne-sentence retrieval objective
riskFlagsstring[]Consent, sensitivity, cost warnings

Layer 2: CapsuleFilterService. Capsule context is sent to a Cloudflare Worker for lightweight LLM relevance scoring:

// capsule-filter.service.ts:71-81
const response = await fetch(workerUrl, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    context: rawContext,
    question: userQuestion.trim(),
    filterOnly: true,
    useFastFilter,
    maxFilterTokens: maxTokens,
  }),
});

Layer 3: ContextSanitizerService. Every piece of external context passes through a sanitizer that combines prompt injection defense with structural noise reduction:

// context-sanitizer.service.ts:6-8
private readonly instructionLine = /ignore previous|system:|developer:|instruction:|you are/i;
private readonly headingCommand = /^#{1,6}\s+.*(ignore|system|developer|instruction|...)/i;
private readonly fenceStart = /^(```|~~~)/;

Lines matching injection patterns are stripped. Code blocks containing prompt or system references are dropped entirely. Blocks exceeding 2,000 characters are clamped. The result is fenced with an explicit untrusted-data header. Injection defense and context compression are the same operation: both reduce what enters the window.

Layer 4: PreReasoningService.summarize(). All available context is reduced to a single focusing string:

// pre-reasoning.service.ts:54-57
const focus =
  highlights.length > 0
    ? `Prioritize: ${highlights.join(' | ')}`
    : 'Use user prompt and standard persona tone.';

This collapses N context sources into one targeting directive. The model sees capsule hints, relevance signals, and a prioritized focus string rather than all the raw context.

2.3 Sequential Processing

RLM design choice 3. The RLM framework requires symbolic recursion: code running inside the environment invokes the model on programmatically constructed subsets of the input, storing intermediate results in variables.1Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026. Our implementation is sequential rather than recursive, but draws on the same principle of decomposing a problem into multiple model calls with shared state.

Our implementation. The cognitive pipeline runs six sequential LLM calls. Each call builds on the results of the previous, with WorkingMemory storing intermediate outputs across steps:

CognitivePipelineService.deepProcess()

  Step 1: detectIntent()         → intent, topics, confidence
  Step 2: internalMonologue()    → assumptions, questions, risks
  Step 3: planContextRetrieval() → providers, queries, goal
  Step 4: planResponse()         → tone, structure, key points
  Step 5: externalDialog()       → final response generation
  Step 6: summarizeConversation()→ compress if history > N

Between Steps 3 and 5, context is fetched and filtered according to the retrieval plan (Section 2.2). The response-generating LLM in Step 5 never sees the raw data. It sees the output of a multi-stage filtering pipeline driven by its own earlier planning step.

Each step operates on a different slice of the problem: Step 1 processes the user's message, Step 2 processes the intent against the persona, Step 3 processes the envelope metadata, Steps 4-5 process the filtered context. WorkingMemory is the shared state across steps, storing intermediate results between calls the way the RLM's REPL environment stores intermediate results in variables. The gap between our sequential pipeline and full symbolic recursion is discussed in Section 5.2.


3. End-to-End: How a Message Traverses the Pipeline

Sections 2.1 through 2.3 describe design choices individually. In practice, a single user message traverses all three in sequence within the decision architecture's closed loop. The state machine determines which executable scripts are available, scripts define what context the pipeline needs, and the success state (the composite condition where the user is understood, trust has formed, and the outcome is achieved) is what the pipeline drives toward. Context is fuel for reaching that state, not an end in itself.

flowchart TB
    UM["User Message"]
    UM --> SS["State · Scripts · Goal<br/>current state gates behavior, scripts define context needs"]
    SS --> SH["Symbolic Handle<br/>context as navigable structure, not flat input"]
    SH --> SP["Sequential Planning<br/>intent → monologue → retrieval goal g → plan π"]
    SP --> PF["Programmatic Filtering<br/>reduce volume, preserve identity signals"]
    PF -.->|"unfiltered"| SUB["Sub-LLM (depth=1)<br/>score relevance against g"]
    SUB -.->|"scored subset"| PF
    PF --> PC["Prompt Construction<br/>identity + RICE overlay + filtered context + script outputs"]
    PC --> LLM["LLM Backend<br/>model-agnostic, persona-constrained"]
    LLM --> MS["Measurement<br/>ICS · VCS · MCS · CFS · DR"]
    MS -.->|"scores adjust scripts, state transitions toward success"| SS
    MS --> VR["Validated Response"]

The pipeline runs inside the decision loop. Each node has a function and a role toward the success state.

NodeFunctionWhat It Serves
State · Scripts · GoalCurrent state gates available scripts (ENGAGED: extraction + coherence; CONTEXT_RICH: inference over extraction)Executable scripts (pre-reasoning, validation, post-reasoning) define what context the pipeline needs
Symbolic HandleAssembles context as typed, navigable structureModel navigates regions rather than ingesting flat input (RLM design choice 1)
Sequential PlanningIntent → monologue → retrieval goal g → provider plan πg is a one-sentence objective for what context should achieve this turn
Programmatic FilteringProvider activation, sub-LLM relevance scoring against g, focus compressionReduces volume while preserving identity signals
Prompt ConstructionIdentity + RICE overlay + filtered context + script outputsAssembles final prompt from all pipeline outputs
LLM BackendModel-agnostic generation, persona-constrainedState gates what kind of response is appropriate
MeasurementICS · VCS · MCS · CFS · DR scored per turnScores adjust scripts; state transitions toward success state

4. Context Packs and Vertical Specialization

Context packs are the unit of composability in the system. Each pack declares its providers, access rules, capability requirements, and cache policies. A persona activates the packs assigned to it. The packs determine which providers fire, which context sources are available, and what access constraints apply. Packs compound: adding a pack to a persona does not duplicate the filtering pipeline; it extends the breadth of context available to the retrieval planner while the filtering layers (Section 2.2) constrain the volume that enters the prompt.

flowchart TB
    P["Persona Definition<br/>role, identity, communication, engagement"]
    P -->|"activates"| DP["Domain Packs<br/>one per vertical: provider, access rules, cache"]
    P -->|"activates"| SP["Shared Packs<br/>brand constraints, user context"]
    DP & SP --> RP["Retrieval Planner<br/>select relevant subset from all active packs"]
    RP --> FF["Filtering Pipeline<br/>score, compress, sanitize"]

Vertical specialization follows from this composability. A healthcare persona is a RICE definition plus a stack of healthcare context packs: patient demographics, medication lists, lab results, clinical guidelines, each with its own provider and access rules (tier gating, age restrictions, consent requirements). A financial advisory persona stacks market data packs, portfolio context, and regulatory constraint packs. The persona shapes how context is presented; the packs determine what context exists to present.

VerticalContext Pack StackWhat the Persona Layer Adds
HealthcarePatient chart, medications, labs, clinical guidelines, care planClinical voice, evidence-based framing, empathetic tone for patient-facing interactions
Financial servicesPortfolio, market data, transaction history, regulatory constraintsFiduciary voice, risk-aware recommendations, compliance language
Customer serviceProduct catalog, support history, brand guidelines, known issuesBrand voice, solution-oriented framing, identity consistency across every interaction

Brand context enters here as a context pack: approved terminology, forbidden phrases, tone parameters, escalation language, competitive positioning rules. When a customer service persona activates the brand pack, the voice constraints become part of the navigable context, scored by VCS (Voice Consistency Score) on every turn. The decision architecture gates which brand packs are available to which personas. The measurement layer validates that the brand voice held.

The deployment model follows from this composability. A vertical deployment is a persona definition (RICE) plus a stack of context packs, wrapped in the decision architecture (state machine, measurement loop, context pipeline). Ship the stack. Swap the model underneath. Identity and domain context hold across model swaps because neither lives inside the model. The persona with 5 packs and the persona with 15 packs pass through the same filtering pipeline. What changes is not the prompt size but the system's ability to find relevant context for a wider range of queries.


5. What We Measure

5.1 Persona Metrics (Existing)

Five persona metrics run inside the decision loop, scored every turn: ICS (identity coherence), VCS (voice consistency), MCS (memory continuity), CFS (context fidelity), and DR (drift rate). For the context pipeline, the two most relevant are CFS, which measures the gap between stored context and applied context (did the model use what the pipeline injected?), and DR, which tracks systematic degradation over conversation length (is the pipeline losing ground turn over turn?).

These metrics separate persona context pipelines from agent context pipelines. Agent frameworks measure task accuracy and cost. They do not measure whether identity held, whether voice was consistent, or whether the system remembered what the user shared last session. These metrics exist because the context pipeline preserves identity, not just delivers information.

5.2 Context Pipeline Metrics (Planned)

The persona metrics validate output quality. They do not measure the efficiency of the context pipeline itself. The RLM paper demonstrates that context-aware systems should also measure token efficiency, cost scaling, and filtering effectiveness.1Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026. We need equivalent metrics for the persona context pipeline.

MetricWhat It MeasuresStatus
Token EfficiencyTotal input tokens consumed per response, including all filtering and planning LLM calls. Lower is better at equal quality.Planned. Token counts are logged per step but not yet aggregated per response.
Filtering Reduction RatioRatio of available context tokens (all providers, all packs) to tokens actually injected into the response prompt.Planned. CapsuleFilterService logs reduction percentages. Needs full-pipeline aggregation.
Provider Activation RatePer-provider: how often activated, how often the activated context was actually used in the response (measured via CFS).Planned. Provider activation is logged. Usage correlation not yet implemented.
Cost per Enriched ResponseTotal API cost for the full pipeline: planning calls + filtering calls (Craig worker) + response generation.Planned. Individual call costs tracked via Prometheus. Needs per-response aggregation.
Context Scaling BehaviorHow response quality (ICS, CFS) and cost change as available context volume grows (number of active packs, total provider output).Not yet measured. Requires controlled evaluation across varying pack counts.

Token efficiency and filtering reduction validate that the pipeline spends compute wisely. Provider activation validates that the LLM-driven planning step makes good decisions. Cost per enriched response validates that the multi-step pipeline justifies its overhead compared to a single LLM call with dumped context. Context scaling behavior validates the core claim of Section 2.2: that prompt size does not grow linearly with available context volume.


6. Where This Goes

6.1 Today: LLM-in-the-Loop Context Planning

The current system uses an LLM call (planContextRetrieval) to decide what context to fetch. This is sub-agent delegation where one of the “sub-agents” is an LLM that plans the retrieval rather than a static routing rule. The model sees metadata (envelope summary, intent, monologue) and emits a structured plan (providers, queries, retrieval goal). Symbolic handles (Section 2.1) and programmatic filtering (Section 2.2) are implemented. Sequential processing (Section 2.3) is in place but not yet recursive. The pipeline does not yet invoke itself on programmatically constructed subsets of the input.

6.2 Near-Term: Full Programmatic Context Navigation

The next step is the model writing retrieval queries against structured context stores, with the persona layer operating as a navigable environment rather than a static prompt block. Three directions follow from the RLM pattern and the context pack composability described in Section 4.

Persona and pack definitions as queryable objects. Today the full RICE definition and all activated context packs are injected into every prompt. In a full RLM architecture, the model receives metadata about the persona and each active pack (section count, entry count, date range, labels), then queries specific sections as needed. A factual question might pull only the Role layer and the relevant capsule section. An emotionally loaded message might pull the full Engagement overlay. The persona and its pack stack become a structured store the model navigates, not blocks it reads end to end.

Search results as a navigable variable. Today, ParallelContextProvider returns search results that enter the prompt after filtering. In a full RLM architecture, search results live as a structured variable: the model sees result metadata (title, source, confidence score, timestamp) and writes queries to pull specific excerpts. The persona already shapes what gets retrieved (POV-aware query rephrasing). The next step is the persona shaping what gets read from those results.

Context packs as vertical REPL environments. A 500-page patient chart lives as a structured object: demographics, medications, lab results, visit notes, imaging reports, each as a context pack section, typed, indexed, and queryable. The model receives metadata:

Available context: Patient chart (523 pages)
  Sections: demographics, medications (47 entries), labs (312 results),
            visit_notes (89 entries), imaging (23 reports)
  Date range: 2019-01-15 to 2026-02-18

Instead of injecting the full chart, the model writes a query:

{
  "retrievalQueries": [
    { "section": "medications", "filter": "active=true", "limit": 20 },
    { "section": "labs", "filter": "date > 2025-06-01 AND category=metabolic", "limit": 10 },
    { "section": "visit_notes", "filter": "date > 2025-09-01", "limit": 5 }
  ]
}

Only relevant sections enter the window. The same pattern applies to every vertical: the financial advisor queries portfolio positions and recent transactions, not the full account history. The customer service agent queries open tickets and known issues, not the full product catalog. The WorkingMemory region structure already supports this. Context packs define what sections exist. The retrieval planner decides which sections to query. The filtering pipeline constrains how much enters the prompt.

6.3 Connection to Measurement

When the pipeline is fully automated (the model decides what to retrieve and what to filter), you can measure what the model chose to attend to versus what a human expert would have used. CFS already measures whether the model used injected context. The next step is measuring whether the model selected the right context in the first place, comparing the retrieval plan against expert retrieval for the same query.

The pipeline makes this auditable. Every step logs its inputs and outputs. The retrieval plan is a JSON object. The filtered context has measured reduction ratios. The final prompt has named, sized blocks. The full path from user message to response is observable and measurable, which is the prerequisite for systematic improvement.

endnotes

  1. Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. MIT CSAIL. arXiv:2512.24601v2. January 2026. github.com/alexzhang13/rlm
  2. Li, K., Liu, T., Bashkansky, N., Bau, D., Viegas, F., Pfister, H., & Wattenberg, M. (2024). Measuring and Controlling Instruction (In)Stability in Language Model Dialogs. COLM 2024. arXiv:2402.10962
  3. Choi, J., Hong, Y., Kim, M., & Kim, B. (2024). Examining Identity Drift in Conversations of LLM Agents. arXiv:2412.00804