When a production AI system makes a wrong decision, two teams respond. The infrastructure team opens the application logs. The governance team asks for the decision record. In most organizations, these are the same artifact -- and that is the problem.
Application logs tell you what happened at the system level: which requests came in, which responses went out, how long things took, and what errors occurred. Decision traces tell you what was decided and why: what context the system evaluated, which rules it applied, what those rules were configured to do at the time, and what outcome they produced. The two serve fundamentally different purposes, answer fundamentally different questions, and fail in fundamentally different ways.
Most AI teams have invested heavily in application logging and observability. Few have invested in decision traces. The gap is invisible until an incident occurs that application logs cannot explain -- which, in production AI systems making consequential decisions, is not a matter of "if" but "when." This article examines five production failure modes through both lenses, introduces the trace replay concept, walks through three side-by-side debugging scenarios, and concludes with ten diagnostic signs that your logging infrastructure will fail you during an incident.
The Structural Difference: What Each Captures
Before analyzing failure modes, it is worth being precise about what each artifact contains. The distinction is not about log verbosity or storage volume -- it is about what category of information is recorded.
| Property | Application Logs | Decision Traces |
|---|---|---|
| Primary question answered | "What happened in the system?" | "What was decided, and why?" |
| Unit of record | System event (request, response, error, metric) | Decision evaluation (inputs, rules, outcome) |
| Captures inputs? | Request parameters, headers, payloads | Typed facts (ATOMs) evaluated at decision time, with provenance |
| Captures logic? | No. The code path is implicit in the application behavior. | Yes. Rule name, rule version, and evaluation outcome are explicit. |
| Captures outcome? | Response status, response body | Decision outcome, enforcement action, downstream effect |
| Reproducibility | Low. Replaying a request may produce different results if system state has changed. | High. Given the same inputs and rule versions, the same outcome is produced. |
| Primary consumer | Engineers debugging system behavior | Governance teams, auditors, incident investigators analyzing decision behavior |
LangSmith and Langfuse represent the current state of the art in LLM observability -- they capture model inputs, outputs, latency, token usage, and prompt versions. These are application-level artifacts. They answer "what did the model produce?" They do not answer "what rule evaluated that output, at what version, against what facts, and with what governance outcome?" That second question is what decision traces answer, and it is the question that matters during a governance incident.
Five Production Failure Modes: Logs vs. Traces
The following five failure modes are common in production AI systems making consequential decisions. Each one is analyzed through both lenses -- application logs and decision traces -- to show what each reveals and what each misses.
Failure Mode 1: Wrong Rule Fired
Scenario: A fraud detection agent flags a legitimate high-value transaction as suspicious, triggering an account freeze. The customer complains. Investigation needed.
Application logs show: An API call to the fraud detection service, a response code indicating "flagged," and a subsequent call to the account management service to apply a hold. Latency was normal. No errors were logged. The system behaved as designed from an infrastructure perspective.
Decision trace shows: The fraud detection evaluation received input ATOMs including transaction.amount = 8500, customer.account_age_days = 14, and transaction.merchant_category = "electronics". Rule new_account_high_value_block v2.3.0 fired because account age was below the 30-day threshold. However, the intended rule was new_account_high_value_review v1.1.0, which routes to human review rather than automatic blocking. The wrong rule fired because rule priority ordering was misconfigured after a recent deployment.
What logs missed: The root cause -- rule priority misconfiguration. The system functioned correctly from an infrastructure perspective; the governance logic was wrong.
Failure Mode 2: Correct Rule, Wrong Input
Scenario: A loan pre-approval agent rejects an application that should have been approved. The applicant's income qualifies, but the system rejected the application.
Application logs show: A call to the underwriting service with a request payload, a response indicating "denied," and the downstream notification sent to the applicant. The request/response pair looks normal.
Decision trace shows: The underwriting rule income_threshold_check v4.0.1 evaluated applicant.annual_income = 42000 against a threshold of 45000. The rule correctly denied based on the input. However, the applicant's actual income from the verified source was 72000. The ATOM applicant.annual_income was asserted by an OCR pipeline that misread the income figure from a scanned document. The provenance record on the ATOM shows the assertion source, enabling the team to trace the error to the OCR system.
What logs missed: The data quality issue. The application log shows a denial with no error. The decision trace shows the specific input value that caused the denial and its provenance, pinpointing the OCR pipeline as the root cause.
Failure Mode 3: Policy Correctly Blocked a Legitimate Action
Scenario: A compliance agent blocks a routine wire transfer to a sanctioned-adjacent country. The transfer is legitimate, but the compliance rule is overly broad.
Application logs show: A blocked transaction, a compliance flag, and a case created in the review queue. The system functioned as designed.
Decision trace shows: Rule sanctions_country_screen v3.2.0 evaluated transaction.destination_country = "TR" against a country list that includes Turkey due to a blanket classification. The rule version history shows that Turkey was added to the restricted list in v3.2.0 (three weeks ago) as part of a broader sanctions update, without a carve-out for commercial transactions below a threshold. The trace enables the governance team to evaluate whether the rule is too broad -- not whether the system malfunctioned (it did not), but whether the policy itself needs refinement.
What logs missed: The policy design question. Logs show a system functioning correctly. The trace exposes the specific rule version and configuration that caused the block, enabling policy review rather than incident investigation.
Failure Mode 4: Trace Gap (Silent Decision)
Scenario: A customer's account settings were changed -- their notification preferences were modified -- but no one can determine what triggered the change.
Application logs show: An API call to the account settings service that modified notification preferences. The call originated from the agent orchestration layer. Upstream, there is a model inference call, but the connection between "model produced this output" and "settings were changed" passes through an untraced code path.
Decision trace shows: Nothing. There is no trace for this action because the code path that modifies account settings based on agent output bypasses the decision plane entirely. The change was made by direct API call from orchestration code, not through a governed decision evaluation. The absence of a trace is itself the finding: a consequential action (modifying customer settings) is occurring outside the governed decision path.
What logs missed: Nothing -- the logs captured the event. But they cannot answer "was this change authorized by a governed decision?" because no decision trace exists. The gap is architectural, not observational.
Failure Mode 5: Enforcement Mismatch
Scenario: A decision trace shows that a rule evaluated to "deny," but the downstream system executed "approve." The decision was made correctly but not enforced correctly.
Application logs show: An API call to the approval service that returned a success status. No error. From the infrastructure perspective, the system processed a request and returned a response.
Decision trace shows: Rule expense_limit_check v2.0.0 evaluated to outcome: "deny" because expense.amount = 12500 exceeded the 10000 threshold. The trace records the deny outcome. But the downstream action recorded in the trace shows action_taken: "approved". The mismatch between decision outcome and enforcement action indicates a bug in the enforcement layer -- the code that translates decision outcomes into API calls is not correctly mapping "deny" to the rejection path.
What logs missed: The discrepancy between decision and enforcement. The application log shows a successful approval. Only the decision trace reveals that the approval contradicted the decision.
The Trace Replay Concept
One of the most powerful properties of decision traces is replay. Because a decision trace captures the exact input facts and the exact rule versions evaluated, the decision can be re-executed deterministically: feed the same ATOMs into the same rule versions and verify that the same outcome is produced. This is the trace replay concept, and it serves three purposes.
Incident verification. When investigating a suspicious outcome, replay confirms whether the outcome was correct given the inputs and rules at the time. If replay produces the same outcome, the problem is in the inputs or the rule definition, not in the evaluation engine. If replay produces a different outcome, the evaluation engine has a determinism bug -- a critical finding.
Rule change impact analysis. Before deploying a new rule version, replay historical traces against the new version to measure how many past decisions would change. This is shadow evaluation applied retroactively: instead of waiting for live traffic, use historical traces to project the impact of a rule change before it goes live.
Regulatory demonstration. When a regulator asks "how was this decision made?" trace replay provides a live demonstration, not a narrative explanation. The regulator can observe the evaluation executing with the recorded inputs and rules, producing the recorded outcome. This is a qualitatively different level of evidence than a written explanation or a screenshot of a log entry.
Application logs do not support replay in any meaningful sense. Replaying a request against the current system may produce a different result because the model has been updated, the system state has changed, or the orchestration code has been modified. Trace replay works because it is scoped to the decision layer, where versioned rules and typed inputs guarantee deterministic evaluation. Temporal's workflow observability provides replay for workflow execution, which is analogous -- but it replays orchestration steps, not decision evaluations. Decision trace replay is specifically about the rules and facts, not the infrastructure.
Three Debugging Scenarios: Side-by-Side Analysis
Scenario 1: Financial Approval Denied Incorrectly
Application log investigation: The engineer queries logs for the transaction ID. They find a request to the approval service, a 200 response with a "denied" body, and a notification sent to the applicant. Time spent: 10 minutes. Finding: the system processed the request and returned a denial. No errors. The engineer escalates to the business team and asks "is this the expected behavior for this input?" The business team cannot answer without examining the rule logic, which is embedded in application code that was deployed three versions ago.
Decision trace investigation: The investigator queries the trace store for the transaction ID. They find the trace record showing: applicant.debt_to_income_ratio = 0.43 (asserted by credit bureau integration), rule dti_threshold v5.1.0 evaluated against a threshold of 0.42. The denial was correct per the rule. The investigator checks the rule change history and finds that v5.1.0 tightened the threshold from 0.45 to 0.42 two weeks ago. The applicant would have been approved under the previous version. Time spent: 5 minutes. Finding: the denial was caused by a recent threshold change, correctly applied but potentially too aggressive.
Scenario 2: Agent Tool Call Allowed When It Should Not Have Been
Application log investigation: The engineer finds the tool call in the orchestration logs: the agent called the account-modification API with parameters to upgrade a customer's service tier. The API returned success. The engineer checks the agent's system prompt, which includes a line about requiring manager approval for tier changes. The model ignored the instruction. Time spent: 30 minutes of prompt archaeology. Finding: the model did not follow the prompt instruction. Recommended fix: strengthen the prompt language.
Decision trace investigation: The investigator queries the trace store and finds no trace for this action. The account modification was executed directly by the agent through a tool call, bypassing the decision plane entirely. The agent's authorization to call the account-modification API was not gated by a decision rule -- it was gated by a prompt instruction, which is non-deterministic. Finding: the authorization gap is architectural, not behavioral. The fix is to route account modifications through a governed decision evaluation, not to adjust prompt wording. Time spent: 3 minutes -- the absence of a trace is itself the diagnostic.
Scenario 3: Workflow Stalled at Checkpoint
Application log investigation: The engineer finds that a multi-step workflow stopped progressing after step 3 of 5. The orchestration logs show that step 3 completed, step 4 was never initiated. The queue shows no pending messages for step 4. The engineer checks for infrastructure issues: queue health is normal, no service outages, no error logs. Dead end. Time spent: 45 minutes. Finding: "something prevented step 4 from starting, but we cannot determine what."
Decision trace investigation: The investigator queries traces for the workflow ID. Step 3's trace shows that the decision evaluation returned outcome: "hold_for_review" because rule escalation_threshold v1.4.0 detected that workflow.cumulative_value = 52000 exceeded the 50000 automatic-processing limit. The workflow correctly paused to await human review. The "stall" is not a bug -- it is a governance checkpoint operating as designed. Time spent: 4 minutes. Finding: the workflow is waiting for human review as required by the escalation rule.
Ten Signs Your Logging Will Fail During an Incident
The following diagnostic checklist identifies structural gaps in logging infrastructure that will surface during a production incident involving AI decision-making. If more than three of these apply to your system, your debugging toolkit is incomplete for governed AI decisions.
- You log model inputs and outputs but not which rules evaluated those outputs. You can reconstruct what the model said but not what the system decided to do about it.
- Your logs do not include rule versions. You know a rule fired, but you cannot determine which version of the rule was active at the time of the decision.
- Reconstructing a past decision requires reading application code from a historical deployment. If the code has been updated, you may be looking at the wrong logic.
- Your audit response to "why was this decision made?" begins with "I think what happened was" rather than "here is the trace record." Narrative reconstruction is not evidence.
- Different decisions made by different parts of your system produce structurally different log entries. Inconsistent logging means inconsistent investigation capability.
- You cannot distinguish between "the system decided not to act" and "the system failed to act." Both produce the same observable result (no action) but have completely different root causes.
- Your logs are mutable. Log entries can be updated, deleted, or overwritten. This disqualifies them as audit evidence.
- Some consequential actions in your system bypass the logging path entirely. Direct API calls from orchestration code, model tool calls that skip the decision layer, or batch processes that run without trace generation create silent decision paths.
- You have no mechanism to detect when a decision outcome and its enforcement action disagree. A rule says "deny" but the downstream system executes "approve," and nobody knows.
- Replaying a past decision against the current system produces a different result, and you have no way to determine whether the difference is due to changed rules, changed inputs, or changed infrastructure. Without versioned rules and captured inputs, you cannot isolate the variable that changed.
Closing the Gap
Application logs and decision traces are not competing approaches. They are complementary layers of production observability that serve different consumers and answer different questions. The error is treating one as a substitute for the other.
Application logs are necessary for infrastructure monitoring, performance analysis, error detection, and system health assessment. They are the foundation of operational observability. Decision traces are necessary for decision accountability, incident investigation at the governance layer, regulatory evidence, and rule change impact analysis. They are the foundation of decision observability.
Teams that have both can answer two distinct questions during any incident: "What happened in the system?" (application logs) and "What was decided and why?" (decision traces). Teams that have only one can answer only one question -- and during a governance incident, the question they cannot answer is the one that matters.
For deeper analysis of decision trace architecture and implementation patterns, see Decision Traces: The Audit Log Pattern That Makes AI Systems Defensible. For the foundational concepts that decision traces capture, see AI Decision Traces Explained.
