Reachability Analysis for AI Agents: Proving Your Policy Cannot Authorize Unsafe Actions

Before you deploy an AI agent that can approve expenses, modify customer records, execute trades, or send communications on behalf of your organization, there is a question that deserves an answer: is there any sequence of inputs that causes this agent to perform an action it should not be authorized to perform?

This is not a theoretical question. It is the concrete version of "is our policy complete?" -- and the answer, for most production AI agents, is unknown. Teams define policies that specify what the agent should do. They test the agent against expected inputs and verify that it produces expected outputs. But they rarely verify the complement: that the agent cannot produce unauthorized outputs regardless of input. The distinction matters because adversarial inputs, edge cases, and novel input combinations do not respect the boundaries of your test suite.

Stanford HAI's research on formal methods in AI safety identifies this as a foundational gap in production AI systems: teams validate what the system should do but do not verify what the system cannot do. Reachability analysis is the discipline that addresses this gap. It asks not "does the agent behave correctly on these inputs" but "is there any input for which the agent behaves incorrectly."

This article explains reachability analysis at a conceptual level for senior engineers who build and operate AI agent systems but do not have formal methods backgrounds. It covers what it means for a policy state to be reachable or unreachable, why policy boundary completeness is distinct from policy correctness, and what practical approaches exist for analyzing agent action spaces without requiring a PhD in formal verification.

What Reachability Means in an Agent Context

In formal verification, reachability analysis determines whether a specific system state can be reached from a given starting state through any sequence of valid transitions. Applied to AI agents, the question becomes: starting from the agent's initial state (awaiting input), is there any sequence of user inputs, tool call results, and intermediate reasoning steps that leads the agent to execute an unauthorized action?

An unauthorized action is one that violates the agent's policy constraints. The nature of the violation depends on the domain: an expense approval agent authorizing a transaction that exceeds its authority, a customer service agent accessing records outside its permitted scope, a data processing agent writing to a system it should only read from, or a communication agent sending a message without required approval.

The action space of an AI agent is the set of all actions the agent can potentially take. For a tool-using agent, this is defined by the tools available to the agent and the parameter ranges those tools accept. For an agent with delegation capability, the action space also includes every action that delegated agents can take. For a multi-step agent, the action space includes sequences of actions, where the output of one action influences the input to the next.

# Simplified agent action space definition
agent: expense-approval-agent
available_tools:
  - name: approve_expense
    parameters:
      amount: { type: float, range: [0, infinity] }
      category: { type: string, enum: [travel, consulting, equipment, software] }
      approver_level: { type: string, enum: [auto, manager, director, vp] }
  - name: deny_expense
    parameters:
      reason: { type: string }
  - name: escalate_to_human
    parameters:
      queue: { type: string, enum: [standard, urgent, compliance] }
  - name: query_budget
    parameters:
      department: { type: string }
      fiscal_quarter: { type: string }

# The reachability question: is there any input sequence that
# causes approve_expense to execute with amount > 10000 and
# approver_level = auto?

The reachability question is whether there exists any path through the agent's decision logic that reaches a specific action-parameter combination that the policy prohibits. If the answer is "no such path exists," the unauthorized state is provably unreachable. If the answer is "yes" or "unknown," the policy has a gap.

Reachable vs. Unreachable: Why the Distinction Matters

A policy state is reachable if there exists at least one input sequence that leads the system to that state. A state is unreachable if no input sequence, regardless of how adversarial or unexpected, can lead the system there.

For AI agent governance, the states you care about making unreachable are the unauthorized actions. Consider an expense approval agent with a policy rule that says "auto-approve expenses under $5,000." The desirable property is that the state "agent auto-approves a $50,000 expense" is unreachable. But is it?

If the policy is enforced by the agent's prompt instructions ("You should not approve expenses over $5,000 without manager approval"), the state is reachable. Prompt instructions are suggestions to a language model, not hard constraints on a deterministic system. A sufficiently adversarial input -- or a sufficiently unusual input that the model interprets differently from the policy author's intent -- can reach the prohibited state. The model may approve the $50,000 expense if the user frames the request in a way that the model interprets as falling outside the prompt instruction's scope.

If the policy is enforced by a deterministic rule in an external decision plane ("IF amount > 5000 AND approver_level = auto THEN deny"), the state is unreachable regardless of what the model proposes. The model can propose the action; the decision plane will reject it before execution. The unauthorized state is unreachable not because the model cannot propose it, but because the execution path is blocked by a constraint that the model cannot bypass.

This is the core insight of reachability analysis applied to AI agents: the reachability of an unauthorized state depends on where the policy is enforced, not just what the policy says. A policy that exists only in the model's context window is a suggestion with reachable violations. A policy enforced by a deterministic gate external to the model is a constraint with unreachable violations -- assuming the gate itself is correctly implemented.

Policy Boundary Completeness vs. Correctness

Two distinct properties of a policy matter for reachability: completeness and correctness.

Correctness means the rules that exist produce the right outcomes. A rule that says "deny expenses over $5,000" is correct if $5,000 is the right threshold for the business. Correctness is a business judgment -- it can only be evaluated by people who understand the domain.

Completeness means the rules cover every possible input. A policy is complete if every input the agent can receive maps to at least one rule that governs the agent's response. A policy is incomplete if there exist inputs for which no rule applies -- leaving the agent to make an ungoverned decision.

Completeness failures are more dangerous than correctness failures because they are invisible. An incorrect rule produces a wrong but traceable outcome -- you can see that the rule fired and evaluate whether its threshold was right. An incomplete policy produces an ungoverned outcome -- the system made a decision, but no rule evaluated it, and the decision may not even appear in the governance log because there was no rule to generate the log entry.

Property	Question	Failure Mode	Detection Difficulty
Correctness	Do the rules that exist produce the right outcomes?	Wrong but traceable decisions	Moderate -- incorrect outcomes are visible in decision logs
Completeness	Does a rule exist for every possible input?	Ungoverned decisions with no trace	High -- absence of a rule means absence of evidence

Reachability analysis primarily addresses completeness. It asks: for every possible input to the agent, does the policy produce a governed outcome? If there exist inputs that reach no rule, those inputs represent ungoverned paths -- and the actions the agent takes on those paths are uncontrolled. OpenAI's research on AI safety via debate demonstrates that adversarial exploration -- systematically searching for inputs that produce undesirable outcomes -- is one of the most effective methods for discovering these ungoverned paths.

The Agent Action Space: What You Are Actually Analyzing

To perform reachability analysis on an AI agent, you need to define the boundaries of what you are analyzing. The agent action space is the complete set of actions the agent can take, parameterized by the inputs those actions accept.

For a tool-using agent, the action space is the Cartesian product of the agent's available tools and their parameter ranges. An agent with 4 tools, each accepting 3 parameters with defined ranges, has a bounded action space that can be enumerated (though the space may be large if parameters are continuous).

Three properties of agent action spaces make reachability analysis both important and challenging:

Composition. Multi-step agents execute sequences of actions where the output of step N influences the input to step N+1. The reachable states after two steps are not simply the union of reachable states at each step -- they are the composition, which can be significantly larger. An agent that can read a budget and then approve an expense has a composed action space where the budget read influences the expense approval parameters. Analyzing the tools independently misses reachable states that emerge from composition.

Delegation. An agent that can delegate to other agents has an action space that includes the action spaces of its delegates. If Agent A can delegate to Agent B, and Agent B can approve transactions up to $100,000, then Agent A's effective action space includes $100,000 transaction approvals -- even if Agent A's direct tools have a $5,000 limit. Reachability analysis must follow delegation chains, or it will undercount the reachable states.

Context sensitivity. The same tool call with the same parameters can be authorized or unauthorized depending on context: the user's identity, the time of day, the cumulative spend in the current session, or the state of an external system. Reachability analysis must account for contextual variation, or it will either miss reachable unauthorized states (by assuming favorable context) or report false positives (by assuming worst-case context for every evaluation).

Practical Approaches to Reachability Analysis

Full formal reachability analysis -- mathematically proving that a state is unreachable -- is practical for deterministic systems with bounded state spaces. For AI agents backed by language models, full formal proof is not currently feasible because the model's behavior is not formally specified. However, several practical approaches provide meaningful reachability assurance without requiring formal proofs.

Approach 1: Boundary Enumeration

Enumerate the boundaries of the agent's action space and verify that a policy rule exists for every boundary region. This is the simplest form of completeness checking: list every tool, list every parameter range, and confirm that the policy has a rule covering every combination. Boundary enumeration catches the most common completeness failure -- a tool or parameter combination that the policy author forgot to cover because it seemed unlikely.

# Boundary enumeration for expense approval agent
action_space:
  tool: approve_expense
  parameters:
    amount: [0, 1000, 5000, 10000, 50000, 100000, infinity]
    category: [travel, consulting, equipment, software]
    approver_level: [auto, manager, director, vp]

# For each combination, verify policy coverage:
# amount=50000, category=consulting, approver_level=auto
#   -> Rule exists? YES: consulting-spend-limit-v2.1 (deny)
# amount=100000, category=software, approver_level=manager
#   -> Rule exists? NO: no rule covers software > 50000 at manager level
#   -> GAP IDENTIFIED: ungoverned path for high-value software purchases

Approach 2: Adversarial Input Testing

Systematically generate inputs designed to reach unauthorized states. This is the testing equivalent of reachability analysis: instead of proving that a state is unreachable, you actively try to reach it and use failure-to-reach as evidence (not proof) of unreachability. Adversarial input testing is more practical than formal proof for systems involving language models, and it catches policy gaps that boundary enumeration misses because it accounts for the model's interpretation of inputs, not just the parameter space.

Effective adversarial testing for agent policies uses several input generation strategies: boundary inputs (values at the exact edge of policy thresholds), composite inputs (sequences of individually authorized actions that compose into an unauthorized outcome), context manipulation (inputs that attempt to alter the context used for policy evaluation), and ambiguous inputs (requests that could be interpreted as falling under multiple policy rules).

Approach 3: Policy Graph Analysis

Model the agent's decision logic as a directed graph where nodes are states and edges are transitions governed by policy rules. Analyze the graph for paths from the initial state to any unauthorized state. This approach is between boundary enumeration (which checks coverage statically) and adversarial testing (which checks coverage dynamically): it examines the structure of the policy to identify paths that could lead to unauthorized states, without requiring a model to be in the loop.

Policy graph analysis is particularly effective for identifying delegation-chain reachability issues. If the graph shows a path from Agent A through Agent B to an action that exceeds Agent A's authority, that path is reachable -- and the policy needs a rule that blocks it at the delegation boundary.

MIT CSAIL's work on reachability analysis for neural networks demonstrates that even for systems with non-deterministic components, structured reachability analysis can bound the set of reachable states and identify regions of the state space that require additional policy coverage.

Approach 4: Shadow Reachability Monitoring

In production, monitor every decision the agent makes and flag any decision that reaches a state not covered by the reachability analysis performed before deployment. This is runtime reachability monitoring: the system observes what states are actually being reached and alerts when a state is reached that was previously classified as unreachable or was not analyzed.

Shadow reachability monitoring serves as a continuous validation of the pre-deployment analysis. If the analysis concluded that a specific state is unreachable and the production system reaches that state, the analysis was wrong -- and the policy has a gap that was not detected statically. This runtime feedback loop closes the gap between what reachability analysis predicted and what the system actually does under real input distributions.

Reachability in Multi-Agent Systems

Multi-agent systems introduce a reachability challenge that does not exist in single-agent deployments: emergent reachability. A state that is unreachable for any individual agent may be reachable through the composition of multiple agents' actions.

Consider three agents in a procurement workflow: Agent A can request purchases up to $10,000, Agent B can approve purchases up to $25,000, and Agent C can execute approved purchases. Individually, no agent can execute a $50,000 purchase. But if Agent A can make multiple requests that Agent B approves independently and Agent C executes as a batch, the composed system can execute $50,000 in purchases -- a state that is unreachable for any individual agent but reachable through composition.

Reachability analysis for multi-agent systems must analyze the composed action space, not just the individual action spaces. This requires understanding delegation relationships (which agents can invoke which other agents), state sharing (what information flows between agents), and accumulation effects (how sequential agent actions can accumulate to reach a state that no single action could reach).

The practical implication is that policy rules must include cross-agent constraints: rules that govern the composed behavior of the system, not just the individual behavior of each agent. A cumulative spend limit that applies across all agents in a workflow, a delegation authority check that prevents authority escalation through handoffs, or a session-scoped constraint that limits the total impact of a multi-agent interaction -- these are the rules that make composed unauthorized states unreachable. Verification requirements such as those outlined in NIST SP 800-204D for software supply chain security apply analogously to multi-agent policy verification: the security properties of the composed system must be verified, not just the properties of individual components.

From Analysis to Assurance: Making Reachability Operational

Reachability analysis is not a one-time exercise. The agent's action space changes when tools are added or modified. The policy's reachability properties change when rules are added, modified, or removed. The input distribution changes as the system encounters new users, new request types, and new edge cases. Each of these changes can make a previously unreachable state reachable.

Operational reachability assurance requires three ongoing practices:

Pre-deployment analysis. Before any agent or policy change is deployed, re-run reachability analysis against the updated configuration. Boundary enumeration and policy graph analysis can be automated and integrated into the deployment pipeline. If the analysis identifies a new reachable unauthorized state, the deployment is blocked until the gap is addressed.

Continuous monitoring. In production, shadow reachability monitoring observes the states the agent actually reaches and compares them to the pre-deployment analysis. Any discrepancy triggers an alert for review.

Periodic adversarial review. On a regular cadence (quarterly for high-consequence systems, semi-annually for lower-risk ones), conduct adversarial input testing against the live system to discover reachable states that static analysis and monitoring have not caught. This is the human-in-the-loop component of reachability assurance -- the recognition that automated analysis has blind spots and that adversarial creativity can find paths that algorithms miss.

The goal is not to achieve mathematical certainty that no unauthorized state is reachable -- for systems involving language models, that certainty is not currently attainable. The goal is to systematically reduce the set of reachable unauthorized states, detect when new ones emerge, and close gaps before they are exploited. This is the engineering approach to a formal methods problem: not proof, but evidence-based assurance with continuous improvement.

The Question Every Agent Deployment Should Answer

Before deploying a production AI agent, the team should be able to answer, with supporting evidence: "We have analyzed the agent's action space. We have identified the unauthorized states. We have verified that our policy covers every boundary region of the action space. We have tested adversarially for paths to unauthorized states. We have deployed monitoring to detect if an unauthorized state is reached in production. Here is the evidence for each claim."

Most teams cannot answer this today. The agent works on the test cases. The policy looks right. The team believes the agent will not do anything unauthorized because it has not done so in testing. But the absence of observed failures is not evidence of unreachability -- it is evidence that the test distribution has not yet found the path.

Reachability analysis is the practice that converts "we believe the policy is safe" into "we have evidence that unauthorized states are unreachable, and we are monitoring for the possibility that our evidence is incomplete." That conversion is what separates a prototype from a production system. It is also what separates a team that can explain their agent's safety properties to an auditor from a team that can only say "it worked in testing."