Context Engineering for AI Agents: Structuring Context Windows, Managing Budgets, and Designing Deterministic Decision Pipelines

Every AI agent decision begins with a context window. Before the model reasons, before rules evaluate, before any action is taken, there is a finite block of text that represents everything the system knows at the moment of decision. The quality, structure, and completeness of that block determines whether the decision is reliable or arbitrary. A model that receives the right context in the right structure will produce consistent, policy-aligned outputs. A model that receives degraded, poorly structured, or incomplete context will produce outputs that look plausible but deviate from policy in ways that are difficult to detect and impossible to reproduce.

Context engineering is the discipline of designing that block deliberately. It is not prompt engineering, which focuses on how to phrase instructions. It is not retrieval-augmented generation, which focuses on how to find relevant documents. Context engineering encompasses both of those activities and adds a layer that neither addresses on its own: how to architect the full context window as a structured, budgeted, and reproducible input to the decision layer.

This article covers three aspects of context engineering that are prerequisite to deterministic AI decision-making: context window architecture, which defines the structural layers of context and why conflating them produces non-deterministic behavior; context budgeting, which addresses the finite allocation problem that silently degrades decision quality; and context pipeline design, which engineers the retrieval-compression-injection sequence that populates context before the decision layer evaluates.

Context Window Architecture: Three Layers That Must Not Be Conflated

Most AI agent implementations treat the context window as a single, undifferentiated block: system instructions at the top, conversation history in the middle, and the latest user message at the bottom. This flat structure works for simple conversational assistants. It fails for agents that must make governed, repeatable decisions, because it conflates three fundamentally different types of context that have different lifetimes, different stability requirements, and different roles in the decision process.

Layer 1: Persistent Context

Persistent context is the information that remains constant across all decisions for a given agent configuration. It includes system instructions, policy rules, authority boundary definitions, and organizational constraints. Persistent context defines what the agent is, what it is allowed to do, and what rules it must follow. It changes only when the agent's configuration is deliberately updated through a governed change process.

The defining characteristic of persistent context is that it must be identical for every evaluation within the same policy version. If a policy rule states "never approve transactions exceeding $10,000 without manager authorization," that rule must be present in every context window where a transaction approval decision is evaluated. If it is sometimes present and sometimes absent due to context window overflow, the agent's behavior becomes non-deterministic at precisely the boundary where determinism matters most: the policy enforcement boundary.

Layer 2: Session Context

Session context is information that persists across multiple interactions within a bounded workflow but does not survive beyond that workflow. It includes accumulated conversation history, intermediate reasoning results, state gathered from previous tool calls in the same session, and any context built up through multi-step agent interactions. Session context represents the agent's working memory for a specific task.

Session context is inherently variable. Each session produces different history, different intermediate results, and different accumulated state. This variability is expected and necessary. The critical design requirement is that session context variability must not interfere with persistent context stability. When a session accumulates enough history to push persistent context out of the effective window, the agent loses access to its governing rules precisely when it has accumulated enough session complexity to need those rules most.

Layer 3: Ephemeral Context

Ephemeral context is information that exists only for a single evaluation cycle. It includes the current tool call results, the current user input, freshly retrieved documents, and any just-in-time data fetched for the current decision. Ephemeral context is the "what is happening right now" layer. It is assembled immediately before evaluation and discarded after the decision is committed.

Ephemeral context is the most variable layer and the most common source of context engineering failures, because it depends on external systems (retrieval services, tool APIs, data stores) that may return different content depending on timing, availability, and data freshness. An agent that retrieves different documents for the same query on two sequential evaluations will produce different decisions, even if its persistent and session context are identical. This is not a model problem. It is a context pipeline stability problem.

Why Conflation Produces Non-Deterministic Behavior

When these three layers are not architecturally separated, two failure modes emerge. First, persistent context gets evicted by session growth. As conversation history accumulates, the fixed-size context window fills, and content is dropped. In a flat context structure, what gets dropped depends on the truncation strategy. If the system truncates from the middle, the Stanford/UC Berkeley "Lost in the Middle" research demonstrates that models already underweight information in the middle of the context window, so policy rules placed there are effectively invisible even before truncation. If the system truncates from the beginning, system instructions and policy rules are the first content evicted.

Second, ephemeral context variability contaminates decision consistency. When retrieved documents, tool results, and policy rules occupy the same undifferentiated block, the model cannot distinguish between authoritative constraints and informational context. A retrieved document that describes a policy exception may be weighted equally with the policy rule itself, producing an outcome that appears reasoned but is not policy-aligned.

Context Layer	Lifetime	Variability	Eviction Tolerance	Role in Decision
Persistent	Agent configuration lifetime	None (version-controlled)	Zero (must never be evicted)	Policy enforcement, authority boundaries
Session	Workflow/task lifetime	Expected, bounded	Low (compress, do not evict)	Working memory, accumulated state
Ephemeral	Single evaluation cycle	High (depends on external systems)	Moderate (can re-retrieve if needed)	Current facts, fresh data, tool results

Context Budgeting: The Finite Allocation Problem

Every context window has a token limit. Whether that limit is 8,000 tokens, 128,000 tokens, or 1,000,000 tokens, it is finite. Context budgeting is the practice of allocating that finite capacity across competing content categories so that policy-critical content is guaranteed to be present at every evaluation.

The central insight of context budgeting is that budget overruns do not produce errors. They produce silent degradation. When a context window overflows, the system does not crash. It truncates. And truncation is not random; it follows whatever strategy the implementation uses (first-in-first-out, middle truncation, summarization), which means it systematically removes specific types of content. If the budget is not explicitly managed, the content that gets removed is determined by accident of position rather than by design.

The Five Budget Categories

A well-engineered context budget allocates tokens across five categories, each with a minimum guarantee and a maximum cap. The specific percentages depend on the application, but the categories are universal.

System Instructions and Identity (5-10% of budget). This is the foundational instruction set: what the agent is, its role, its behavioral constraints, its output format requirements. This category has the smallest token footprint but the highest priority. It must never be truncated. If the system instructions are incomplete, every subsequent evaluation is ungoverned.

Policy Constraints and Rules (10-20% of budget). This is the normative layer: the explicit rules that govern what the agent can and cannot do, the boundary conditions for its authority, the escalation triggers, and the compliance requirements. For agents operating in governed environments, this category is the single most important budget allocation. The NIST AI Risk Management Framework identifies the ability to enforce documented constraints as a core requirement for trustworthy AI systems. If policy constraints are evicted from context, that capability is lost.

Retrieved Knowledge (15-30% of budget). This is the informational layer: documents, data records, knowledge base entries, and other retrieved content that provides the factual grounding for the current decision. This category is the most elastic, meaning it can be compressed or reduced without losing the agent's governing rules. The quality of retrieval directly affects decision quality, but unlike policy constraints, retrieved knowledge can be re-fetched or summarized if budget pressure requires it.

Tool Results and Current State (10-20% of budget). This is the operational layer: the outputs of tool calls, API responses, database query results, and other dynamic content that represents the current state of the world. Tool results are ephemeral and often large. A single API response can consume thousands of tokens. Effective budgeting requires either constraining tool output size at the tool interface or applying post-retrieval compression before injection into the context window.

Conversation History (20-40% of budget). This is the continuity layer: the record of previous interactions that gives the agent context about the ongoing workflow. History is the most compressible category because older interactions can be summarized without losing the thread of the conversation. It is also the category most likely to cause budget overruns in long-running sessions, which is why session context management requires explicit compression strategies.

What Happens When Budgets Are Not Managed

Without explicit budgeting, context windows follow a predictable degradation path. In the first few interactions, everything fits. The system instructions, policy rules, retrieved documents, tool results, and history all fit within the token limit. The agent behaves correctly. Over subsequent interactions, history accumulates. Tool results from previous steps may be retained. Retrieved documents from earlier queries linger. The context fills.

Then, at an interaction that looks no different from any previous one, the context window overflows. Content is truncated. If the truncation removes a policy rule that was governing a specific decision type, the agent's behavior for that decision type changes. It does not fail. It does not produce an error. It produces a decision that looks reasonable but is not policy-aligned. And because the truncation is invisible to the calling system, no one detects the deviation until the decision's consequences surface downstream.

The DAIR.AI Prompt Engineering Guide documents this class of failure across multiple model architectures and notes that longer context windows do not eliminate the problem; they merely delay it. A 128K-token window that is not budgeted will eventually exhibit the same degradation as an 8K-token window that is not budgeted. The window is larger, so the failure takes longer to manifest, but the failure mode is identical.

Context Pipeline Design: Engineering the Retrieval-Compression-Injection Sequence

A context pipeline is the engineered sequence of operations that assembles the context window before each decision evaluation. It is the mechanical process that transforms raw content sources into a structured, budgeted context block. The pipeline's job is to ensure that every evaluation receives context that is complete (all required content is present), correct (content reflects the current state of the world and the current policy version), structured (content is organized by layer and priority), and within budget (total token count does not exceed the window limit).

Stage 1: Retrieval

The retrieval stage gathers raw content from all sources: policy rules from the rule store, documents from the knowledge base, tool results from external APIs, and history from the session store. At this stage, no compression or filtering has been applied. The retrieval stage operates on the principle of over-fetching: gather more content than will fit in the context window, so that the subsequent stages can select and compress.

The critical design decision in the retrieval stage is ordering and prioritization. Policy rules and system instructions are retrieved first and marked as non-evictable. Retrieved knowledge and tool results are fetched next and ranked by relevance to the current decision. History is retrieved last and is the first candidate for compression if the budget is tight.

Stage 2: Compression

The compression stage reduces the retrieved content to fit within the budget allocation for each category. Compression strategies differ by content type. History compression typically uses summarization: replace the full transcript of earlier interactions with a structured summary that preserves key facts and decisions while reducing token count. Tool result compression uses extraction: pull the specific data values needed for the current decision from large API responses, discarding the structural overhead. Knowledge compression uses relevance filtering: retain the passages most relevant to the current query and discard the rest.

The critical constraint is that compression must never be applied to persistent context. Policy rules, system instructions, and authority boundaries must pass through the compression stage unmodified. If a policy rule is "too long" for the budget, the response is not to compress the rule but to increase the budget allocation for policy constraints. A compressed policy rule is a modified policy rule, and a modified policy rule is a governance failure.

Stage 3: Injection

The injection stage assembles the final context window from the compressed content blocks, placing each block at the correct position within the window. Position matters. Anthropic's research on long-context utilization demonstrates that content placement within the context window significantly affects how models weight that content during generation. Policy-critical content should be placed at positions where model attention is strongest, which current research consistently identifies as the beginning and end of the context window.

A well-designed injection stage follows a consistent template:

[SYSTEM INSTRUCTIONS]         <- Fixed position, beginning of window
  Agent identity and role
  Output format requirements
  Behavioral constraints

[POLICY RULES]                 <- Fixed position, immediately after system
  Authority boundaries
  Decision rules (versioned)
  Escalation triggers
  Compliance constraints

[RETRIEVED KNOWLEDGE]          <- Variable position, middle of window
  Relevant documents (ranked)
  Reference data

[TOOL RESULTS]                 <- Variable position, middle of window
  Current API responses
  Database query results
  External system state

[SESSION HISTORY]              <- Variable position, compressed
  Summarized earlier interactions
  Key decisions and state

[CURRENT INPUT]                <- Fixed position, end of window
  Current user message or trigger
  Current decision request

This template ensures that the two highest-priority content categories (system instructions and policy rules) occupy the two highest-attention positions (beginning), while the current decision input occupies the end position. Variable content fills the middle, where it contributes context but does not compete with policy-critical content for the model's attention.

Pipeline Stability and Deterministic Evaluation

A context pipeline is stable when the same decision scenario, evaluated at different times, produces context windows that are identical in their policy-critical content and structurally consistent in their variable content. Pipeline stability is what enables deterministic policy evaluation: the guarantee that a decision evaluated against specific rules with specific inputs produces the same outcome regardless of when or how many times it is evaluated.

Three design principles support pipeline stability:

Version-pin all persistent context. Policy rules should be retrieved by version identifier, not by "latest." When a decision is evaluated, the pipeline should inject the specific rule version that is currently active, and the decision trace should record which version was used. This enables replay fidelity: the ability to reconstruct the exact context window that was present at any past decision point.

Isolate retrieval variance. Retrieved knowledge and tool results will vary between evaluations. This variance should be confined to the variable-content positions in the context template. If the retrieval system returns different documents for the same query on different days, that variance affects the informational context but not the governing context. The decision rules remain stable even as the facts they evaluate change.

Enforce budget guarantees mechanically. Budget allocation should be enforced by the pipeline code, not by convention. The pipeline should measure the token count of each content block after compression and before injection. If a block exceeds its budget, the pipeline should apply additional compression or raise an error, never silently overflow into another category's allocation. The system instructions and policy rules category should have a hard minimum guarantee that the pipeline enforces programmatically.

Context Engineering as Governance Infrastructure

The argument of this article reduces to a single claim: for AI agents that must make governed, auditable, repeatable decisions, context engineering is not an optimization technique. It is governance infrastructure. The context window is the agent's entire world at the moment of decision. If that world is structured, budgeted, and assembled through a stable pipeline, the agent's decisions are traceable to specific rules applied to specific facts. If that world is a flat, unbounded, unstructured accumulation of text, the agent's decisions are traceable to nothing more specific than "whatever happened to be in the context window at the time."

The three disciplines described here -- layered context architecture, explicit budget allocation, and engineered pipeline assembly -- are not theoretical. They are the mechanical prerequisites for any system that claims to evaluate policy deterministically. A decision rule that exists in a policy store but does not reliably appear in the context window at evaluation time is not an enforced rule. It is a documented aspiration. The distance between the two is the distance between a governed AI system and an ungoverned one that has governance documentation.

For teams building context pipelines that feed deterministic decision layers, the architectural patterns described in the decision plane architecture provide the evaluation framework, while decision traces provide the mechanism for recording what context was present at each evaluation. Together, these components form the foundation of AI systems that are not merely capable, but reliably governed.