A decision trace is the most important artifact most AI teams are not producing. When an AI system approves a loan, escalates a support ticket, suspends an account, or recommends a treatment plan, the trace is the authoritative record of what the system knew, what rules it evaluated, and what outcome those rules produced. Without it, every post-hoc explanation is reconstruction -- plausible narrative built from fragments rather than evidence drawn from the record.
This article explains decision traces from the ground up. It defines what they must contain, distinguishes them from the four other log types teams commonly mistake for traces, walks through a realistic example in detail, identifies the gaps that make traces incomplete even when they look thorough, and provides a practical audit checklist aligned with EU AI Act Article 13 requirements. If you are building AI systems that make consequential decisions -- decisions that affect people, money, access, or compliance -- this is the format your audit evidence needs to take.
What a Decision Trace Must Contain
A decision trace is not a log entry. It is a structured record with six required components, each serving a specific purpose in making the decision reconstructable and defensible. Omitting any one of these components degrades the trace from "audit evidence" to "partial documentation."
1. Identity: Who or What Made the Decision
Every trace must identify the decision-making entity. In AI systems, this is not a single value -- it is a chain: the agent identity (which model or service), the delegating principal (the user or system that authorized the agent to act), and the session or workflow context (which execution run this decision belongs to). A trace that says "the system decided" without specifying which agent, operating under whose authority, in which session, fails the most basic accountability question.
2. Context: What the System Knew at Decision Time
Context is the complete set of inputs the decision system evaluated. Not "all inputs available in the environment" -- specifically the inputs that were presented to the decision logic at the moment of evaluation. This distinction matters because a common trace failure is recording what was available but not what was used. If the system had access to 47 data points but evaluated three of them to produce the decision, the trace must record those three -- with their exact values at evaluation time.
The NIST AI Risk Management Framework calls this "data provenance for decision inputs" -- the ability to trace each input back to its source, verify its integrity, and confirm it was the value used at the moment of evaluation, not a later-updated version.
3. Evaluated Rules: Which Logic Applied
The trace must record which rules or policies were evaluated, at which version, and what each rule returned. This is the component that makes traces fundamentally different from logs. A log records that an event happened. A trace records which deterministic logic governed that event. If the decision was "deny this refund," the trace must show which rule produced that denial, what version of the rule was active, and what inputs triggered it.
Version pinning is critical. A rule that was version 3.2 when the decision was made may now be version 4.1. The trace must reference the version that was active at decision time, not the current version. Without version pinning, replay fidelity is impossible.
4. Data Provenance: Where the Inputs Came From
Context tells you what the inputs were. Data provenance tells you where they came from -- which database, which API, which retrieval system, at what timestamp. Provenance matters because input integrity is a precondition for decision integrity. If the system evaluated a credit score of 720, the trace should record that this value was retrieved from credit bureau X at timestamp T, not simply that the value was 720.
In agentic AI systems where context is assembled from multiple retrieval sources, RAG pipelines, and real-time API calls, provenance tracking is the difference between a defensible decision and one that can be challenged on the basis of "how do you know that input was accurate?"
5. Outcome: What the System Decided
The outcome is the committed result of rule evaluation: approve, deny, escalate, modify, or defer. This must be an explicit, system-defined value -- not free-text model output. A trace outcome of "the model said it seemed like a good idea to approve" is not a decision record. A trace outcome of "decision_outcome": "APPROVE" from a deterministic evaluation is.
6. Enforcement Record: What Actually Happened
The enforcement record closes the loop between decision and action. It documents what the system actually did after the decision was rendered. Did the approved refund actually process? Was the denied access request actually blocked? Was the escalated ticket actually routed to the specified queue? The enforcement record turns a decision trace from "what the system intended to do" into "what the system actually did," which is what auditors and incident investigators need.
Decision Traces vs. Other Log Types
The most common governance failure McKinsey's 2025 State of AI survey identifies is organizations believing they have adequate audit evidence because they have comprehensive logging. They do have comprehensive logging. They do not have decision traces. These are different artifacts.
| Log Type | What It Records | What It Cannot Answer |
|---|---|---|
| Application logs | Errors, warnings, HTTP requests, service calls | Why the system decided to take this action |
| LLM observability logs | Prompt/response pairs, token usage, latency, model version | Which governance rules approved or denied the action |
| Event logs | State transitions: "status changed from active to suspended" | What logic determined this transition should happen |
| Audit event logs | Who accessed what resource at what time | What the decision evaluation inputs and rule outcomes were |
| Decision traces | Identity, context, rules evaluated, provenance, outcome, enforcement | (Complete for decision-level accountability) |
Each of these log types is valuable. None of the first four substitutes for a decision trace. A well-instrumented production system has all five: application logs for debugging, LLM observability for model performance, event logs for state tracking, audit event logs for access control, and decision traces for governance accountability. The problem is not having too many log types; it is not recognizing that decision traces are a distinct artifact with distinct requirements.
A Realistic Trace Example: Financial Approval Workflow
To make the abstract concrete, here is a decision trace for a realistic scenario: an AI agent evaluating whether to approve a customer's request for an expedited refund. This is a consequential decision -- it affects money, customer relationships, and potentially regulatory obligations.
{
"trace_id": "tr_8f2a4b91-c7e3-4d1f-b6a2-91e4f3c8d7b5",
"schema_version": "2.1.0",
"timestamp": "2026-05-06T14:32:07.892Z",
"identity": {
"agent_id": "refund-agent-prod-east-1",
"agent_version": "3.4.2",
"delegating_principal": "user:cs-rep-jmartinez",
"delegation_scope": "refund_processing",
"session_id": "sess_a91b2c3d"
},
"context": {
"atoms_evaluated": [
{
"name": "refund_amount_usd",
"type": "decimal",
"value": 847.50,
"source": "order_service",
"retrieved_at": "2026-05-06T14:32:07.201Z"
},
{
"name": "customer_tenure_months",
"type": "integer",
"value": 38,
"source": "customer_db",
"retrieved_at": "2026-05-06T14:32:07.312Z"
},
{
"name": "refund_count_90d",
"type": "integer",
"value": 1,
"source": "refund_history_service",
"retrieved_at": "2026-05-06T14:32:07.445Z"
},
{
"name": "order_status",
"type": "enum",
"value": "DELIVERED",
"source": "order_service",
"retrieved_at": "2026-05-06T14:32:07.201Z"
},
{
"name": "customer_region",
"type": "string",
"value": "EU-DE",
"source": "customer_db",
"retrieved_at": "2026-05-06T14:32:07.312Z"
}
]
},
"rules_evaluated": [
{
"rule_name": "refund_amount_threshold",
"rule_version": "2.3.0",
"condition": "refund_amount_usd <= 1000.00",
"result": "PASS",
"note": "Within auto-approval threshold"
},
{
"rule_name": "customer_tenure_minimum",
"rule_version": "1.1.0",
"condition": "customer_tenure_months >= 6",
"result": "PASS",
"note": "Long-tenured customer"
},
{
"rule_name": "refund_frequency_cap",
"rule_version": "1.4.0",
"condition": "refund_count_90d < 3",
"result": "PASS",
"note": "Below frequency cap"
},
{
"rule_name": "eu_consumer_rights_expedited",
"rule_version": "3.0.1",
"condition": "customer_region STARTS_WITH 'EU' AND order_status = 'DELIVERED'",
"result": "PASS",
"note": "EU consumer expedited refund right applies"
}
],
"decision_outcome": "APPROVE",
"decision_confidence": "DETERMINISTIC",
"enforcement": {
"action_taken": "REFUND_INITIATED",
"refund_id": "ref_7d3e2f1a",
"amount_usd": 847.50,
"method": "original_payment_method",
"executed_at": "2026-05-06T14:32:08.103Z",
"execution_latency_ms": 211
}
}Read this trace from top to bottom and notice what it gives you:
- Who: The refund agent (version 3.4.2), operating under delegation from customer service representative J. Martinez, scoped to refund processing.
- What it knew: Five specific facts, each typed, each sourced, each timestamped to the millisecond of retrieval. Not "all available customer data" -- the five atoms the decision logic actually evaluated.
- What rules applied: Four rules, each version-pinned, each with a clear condition and result. Any one of these rules returning "FAIL" would have changed the outcome.
- What it decided: Approve. Deterministically -- not "probably" or "based on model confidence."
- What happened: Refund initiated, with a specific refund ID, amount, method, and execution timestamp.
Six months from now, if a regulator asks "why was this refund approved?" you can hand them this trace. No reconstruction. No inference. No "I think what happened was." The record is complete.
Common Gaps: When Traces Look Complete But Are Not
Teams that implement decision traces often produce traces that appear comprehensive but fail under audit scrutiny. The following five gaps are the most common.
Gap 1: Missing Data Provenance
The trace records that refund_amount_usd = 847.50 but not where that value came from. When an auditor asks "how do you know the refund amount was $847.50?" the answer must be "it was retrieved from the order service at timestamp T" -- not "it was in the request." Without provenance, input integrity cannot be verified.
Gap 2: Unversioned Rules
The trace records that refund_amount_threshold returned PASS, but does not record which version of the rule was evaluated. If the threshold was $1,000 then but is now $500, a replay attempt will produce a different result. Without version pinning, replay fidelity fails -- and replay fidelity is what Article 13 of the EU AI Act requires for high-risk system transparency.
Gap 3: Missing Enforcement Record
The trace records a decision of "APPROVE" but not whether the approval was actually executed. Did the refund process? Was it intercepted by a downstream system? Did it fail due to a payment gateway error? Without the enforcement record, the trace documents intention but not action. An auditor needs to know what the system did, not just what it decided.
Gap 4: Silent Decisions
This is the most dangerous gap because it is invisible. Silent decisions are consequential actions that execute outside the traced decision path. If the agent has a fallback code path that handles edge cases without routing through the decision layer -- a hardcoded exception, a legacy API call, a manual override that bypasses governance -- those decisions generate no trace. The trace store shows 100% of traced decisions were governed. The reality is that 15% of decisions were never traced at all.
Detection: compare the count of consequential actions (refunds processed, accounts suspended, access grants issued) against the count of decision traces for the corresponding decision type in the same time window. Persistent discrepancies indicate silent decision paths. This is the "trace reachability" problem discussed in depth in the infrastructure guide.
Gap 5: Conflated Identity
The trace records "agent-prod" as the identity but does not distinguish between the agent identity, the delegating principal, and the delegation scope. When a human customer service representative delegates refund authority to an agent, the trace must record all three: who the agent is, who delegated, and what scope was granted. A trace with only an agent ID cannot answer "who authorized this agent to act?"
Trace Reachability: Are All Your Decisions Traced?
Trace reachability is the property that every consequential decision in your system produces a decision trace. It is the completeness guarantee. Unlike the other components of a decision trace -- which are properties of individual records -- reachability is a property of the system as a whole.
Testing reachability requires three steps:
- Enumerate all decision types. List every consequential action your system can take: approve, deny, escalate, suspend, grant, revoke, transfer, delete. Each is a decision type.
- Count actions vs. traces. For each decision type, over a given time window, count the number of actions executed and the number of corresponding decision traces generated. These counts should match.
- Investigate discrepancies. Any discrepancy indicates an untraced decision path. These paths are architectural gaps, not logging bugs. Fixing them requires routing the untraced path through the decision layer, not adding a log statement.
Reachability analysis should run as an automated check -- not a periodic manual audit. The reachability problem discusses this discipline in detail, including how to detect rules that never fire (which may indicate untraced decision paths that bypass the rule set entirely).
EU AI Act Article 13 Audit Checklist
Article 13 requires that high-risk AI systems provide "sufficient transparency to enable deployers to interpret the system's output and use it appropriately." For teams preparing for compliance, the following checklist maps decision trace capabilities to Article 13 obligations.
| Article 13 Requirement | Decision Trace Component | Audit Evidence |
|---|---|---|
| System capabilities and limitations must be documented | Rule definitions with version history | Versioned rule set export showing all active policies |
| Intended purpose and conditions of use must be specified | Delegation scope and agent identity | Trace records showing scope constraints on each agent |
| Level of accuracy and relevant metrics must be provided | Decision confidence field (deterministic vs. probabilistic) | Aggregate accuracy reporting from trace outcome analysis |
| Known foreseeable risks must be documented | Reachability analysis reports | Regular reachability gap reports showing untraced paths |
| Human oversight measures must be enabled | Interrupt/approval records in trace | Traces showing human approval events with identity and scope |
| Interpretation of outputs must be possible | Rules evaluated with conditions and results | Per-decision trace showing exact logic path to outcome |
The practical test: for any decision your system made in the past 90 days, can you produce a trace record that satisfies every row in this table within five minutes? If the answer is yes for sampled decisions across decision types, your trace infrastructure is Article 13-ready. If the answer requires engineering investigation for any row, you have a gap.
How to Read a Decision Trace
Decision traces are designed to be read by three audiences: engineers debugging a decision, compliance officers conducting audits, and incident investigators reconstructing what happened. Each audience reads the same trace differently.
Engineers read bottom-up: start with the enforcement record (what actually happened), check the decision outcome (was it correct), then examine the rules evaluated (which rule drove the outcome) and the context (were the inputs accurate). The engineering question is "did the system behave correctly given its inputs and rules?"
Compliance officers read top-down: start with identity (who authorized this agent), check delegation scope (was the agent operating within bounds), review rules evaluated (do these rules match documented controls), and verify the enforcement record (was the decision actually enforced). The compliance question is "does this decision satisfy our stated governance controls?"
Incident investigators read outward from the anomaly: start with the decision that caused the incident, trace backward to the context (were the inputs poisoned or stale), the rules evaluated (did the right rules fire, or were some unreachable), and the identity chain (was there an unauthorized delegation). The investigation question is "what went wrong and where in the decision chain did it go wrong?"
A well-structured trace serves all three readings without modification. This is why the six components -- identity, context, rules, provenance, outcome, enforcement -- are all required: each component answers questions that at least one audience will ask.
From Traces to Trust
Decision traces are not a compliance checkbox. They are the mechanism by which AI systems earn and maintain trust -- with regulators, with enterprise customers, with internal stakeholders, and with the people affected by automated decisions. A system that can produce a complete, immutable, replayable record of every consequential decision is a system that can be held accountable. A system that cannot is a system that must be trusted blindly, which is exactly the kind of trust that regulators, customers, and affected individuals are no longer willing to extend.
For the infrastructure that generates these traces automatically, see Infrastructure for Deterministic AI Decisions. For the audit log pattern that makes traces legally defensible, see Decision Traces: The Audit Log Pattern That Makes AI Systems Defensible.
