The EU AI Act Compliance Gap: What High-Risk AI Systems Must Log

The EU AI Act's Article 13 on transparency gets most of the attention. Article 12 gets almost none. This is a problem, because Article 12 is where the Act specifies what high-risk AI systems must actually record -- the automatic logging of events that makes transparency, traceability, and human oversight technically possible. Article 13 tells you that your system must be transparent. Article 12 tells you what your system must write down.

Most engineering teams building AI-powered products have logging. They log requests, responses, errors, latencies, token counts, and model identifiers. What they do not log -- in the specific form the Act requires -- is the decision-level information that connects inputs to logic to outcomes in a way that a regulator, auditor, or oversight authority can trace after the fact.

This article provides a detailed breakdown of what Article 12 of Regulation (EU) 2024/1689 actually requires, translates those requirements into concrete engineering specifications, and identifies the five specific gaps where most current logging implementations fall short.

Article 12 in Plain Language: What the Regulation Says

Article 12 is titled "Record-keeping" in some translations and "Automatic recording of events" in others. Its core requirement is that high-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events (logs) while the system is operating. These logging capabilities must be appropriate to the intended purpose of the high-risk system and in compliance with recognized standards or common specifications.

The Article specifies that the logging capabilities shall ensure a level of traceability appropriate to the intended purpose of the system throughout its lifecycle. In particular, Article 12(2) mandates that logging must enable the monitoring of the system's operation with respect to the occurrence of situations that may result in the system presenting a risk, and must facilitate the post-market monitoring referred to in Article 72.

Article 12(3) adds specificity: for high-risk AI systems referred to in Annex III (the use-case categories that trigger high-risk classification), the logging capabilities shall provide, at a minimum, the capacity to record:

The period of each use of the system (start and end dates and times)
The reference database against which input data has been checked
The input data for which the search has led to a match
The identification of the natural persons involved in the verification of the results

These specific requirements were drafted with biometric identification systems in mind -- the most explicit Annex III use case. But the general obligations of Article 12(1) and 12(2) apply to all high-risk AI systems, including those in employment, credit, insurance, education, and essential services. The general requirements are where most teams have a compliance gap, because they require logging capabilities that go well beyond what standard application monitoring provides.

Reading Article 12 Together with Articles 9 and 13

Article 12 does not stand alone. It is one leg of a three-legged compliance obligation that also includes Article 9 (risk management) and Article 13 (transparency). Understanding the logging requirements means reading all three together.

Article 9 requires providers of high-risk AI systems to establish, implement, document, and maintain a risk management system. This system must identify and analyze known and reasonably foreseeable risks, and must adopt suitable risk management measures. The connection to Article 12 is direct: you cannot identify risks from your system's operation if your system does not record the events that would reveal those risks. Article 9 creates the demand; Article 12 creates the supply.

Article 13 requires that high-risk AI systems be designed and developed in such a way that their operation is sufficiently transparent to enable deployers to interpret the system's output and use it appropriately. This includes the obligation to provide deployers with information about the system's performance characteristics, known limitations, and the logic underlying its automated decisions. Article 12's logging is what makes Article 13's transparency mechanically possible. A system that does not record the inputs it evaluated, the logic it applied, and the output it produced cannot retroactively provide the transparency Article 13 demands.

Stanford HAI's policy brief on governing AI makes this interdependency explicit: logging is not a standalone compliance obligation but the technical foundation that makes every other governance requirement satisfiable. Without adequate logging, risk management is speculative, transparency is aspirational, and human oversight is performative.

The Four Logging Dimensions the Act Requires

Translating Articles 9, 12, and 13 into engineering requirements produces four distinct logging dimensions. Each dimension serves a different compliance purpose, and each requires different data to be captured at different points in the system's execution path.

Dimension 1: Event Recording for Traceability

The core Article 12 requirement: the system must automatically record events that enable tracing how a specific output was produced from specific inputs. This is not application logging -- it is decision-path logging. For a system that evaluates a loan application, traceability means capturing the specific data fields the system considered, the specific model version or rule version that processed them, and the specific output that was produced. The record must be sufficient to reconstruct the path from input to output without relying on general descriptions of how the system works.

The engineering implication is that every consequential evaluation must produce a structured record at the point of decision, not at the point of API response. If your system makes three internal decisions before returning a final response to the user -- a risk score calculation, a threshold evaluation, and an action determination -- each of those three decisions is an event that Article 12 requires to be recorded.

Dimension 2: Human Oversight Indicators

Article 14 of the Act requires that high-risk AI systems be designed to support effective human oversight. Article 12's logging obligations support this by requiring that the system record information about human involvement in the decision process. When a human reviews, overrides, confirms, or modifies an AI-generated output, that interaction must be logged as a distinct event.

This is where most implementations have a structural gap. Systems that route AI outputs through a human review queue typically log the human's final decision but not the AI's original recommendation. A compliance-adequate log captures both: what the AI recommended, what the human decided, and whether those diverged. When they diverge, the log should capture enough context to understand why -- not through natural language explanation, but through structured metadata about the review.

Dimension 3: Operational Monitoring Data

Article 12(2)'s requirement to facilitate monitoring for situations that may present a risk creates an obligation to log operational data that enables anomaly detection. This includes performance metrics that would reveal model drift (changes in output distributions over time), input data that falls outside the system's validated operating conditions, and error rates or confidence scores that fall below acceptable thresholds.

The NIST AI Risk Management Framework, particularly Manage 4.1, provides complementary guidance on what constitutes adequate post-deployment monitoring. NIST recommends logging sufficient data to detect changes in system behavior that indicate the system is no longer operating within its validated performance envelope. The EU AI Act does not prescribe specific metrics, but the "appropriate to the intended purpose" standard in Article 12(1) means that a system without drift detection logging may be found inadequate if its performance degrades in a way that the logging should have caught.

Dimension 4: Data Retention and Access Controls

Article 12(4) specifies that logs shall be kept for a period appropriate to the intended purpose of the high-risk AI system, of at least six months, unless provided otherwise in applicable Union or national law. This is a minimum -- specific sectors may require longer retention. The logs must be accessible to the market surveillance authorities and, where applicable, to the national supervisory authorities.

The engineering requirements here are retention policy enforcement, access control mechanisms that distinguish between operator access (the deployer's team) and authority access (regulators and auditors), and immutability guarantees that prevent log tampering after the fact. A log that can be modified by the system operator is not a compliance-adequate log.

Where Most Current Logging Implementations Fall Short

Most AI systems in production today have logging. The question is not whether they log, but whether what they log satisfies Article 12's requirements. Five specific gaps appear consistently across organizations that have evaluated their logging against the Act's requirements.

Gap 1: Logging Model Calls Instead of Decisions

The most common gap. Teams log the LLM API call -- the prompt sent, the response received, the token count, the latency -- and treat this as their decision record. But an API call is not a decision. The decision is the outcome that the system committed to based on the model's output, after any post-processing, threshold evaluation, or business logic application. If the model returns a risk score of 0.73 and the system's threshold for action is 0.70, the model call log shows "0.73" but does not capture the threshold evaluation, the threshold value, or the decision to act. That decision is the regulated event; the model call is an implementation detail.

Gap 2: Missing Input Capture at Decision Time

Traceability requires knowing what the system evaluated when making a decision. Many systems log inputs at the API boundary -- the HTTP request payload or the function call parameters -- but not the specific data that was actually evaluated at the decision point. If the system enriches, transforms, or selects from the input data before evaluation, the API-level log does not reflect what the model or rule actually saw. A compliance-adequate log captures the inputs as they existed at the moment of evaluation, after all preprocessing.

Gap 3: No Version Pinning of Decision Logic

When a regulator asks "what logic did your system apply to this decision?", the answer must be specific to the point in time when the decision was made. If your decision logic lives in a prompt that was updated three times in the last month, and your logs do not record which version of the prompt was active when a specific decision was made, you cannot answer the question. The same applies to model versions, rule versions, feature flag states, and configuration parameters that influence decision outcomes. Article 12's traceability requirement implicitly demands version pinning: every logged decision must reference the specific version of the logic that produced it.

ISO/IEC 42001:2023 makes this requirement explicit in its AI management system framework: the relationship between system configuration (including model version, decision logic version, and operational parameters) and system outputs must be traceable through documented records.

Gap 4: Human Oversight Events Not Logged as Distinct Records

Systems that include human review in their decision pipeline often log only the final outcome -- the decision after human review -- without logging the AI's pre-review recommendation as a separate, distinct record. This makes it impossible to measure the rate of human override, identify patterns in which decisions humans change, or demonstrate to a regulator that human oversight is actually functioning (as opposed to humans rubber-stamping AI outputs).

A compliance-adequate implementation logs three distinct events for a human-in-the-loop decision: the AI's recommendation (with full input capture and logic version reference), the human's review action (confirm, modify, or override), and the committed decision (with a reference linking it to both the AI recommendation and the human action). This three-event pattern is what makes human oversight auditable rather than merely claimed.

Gap 5: Mutable Logs Without Integrity Guarantees

Logs stored in a standard database, a mutable file system, or a cloud object store without write protection are not compliance-adequate under Article 12. A log that can be altered by the system operator after the fact -- whether through database updates, file overwrites, or object versioning that allows deletion of previous versions -- does not satisfy the traceability requirement because the record's integrity cannot be independently verified.

Append-only storage, cryptographic hash chaining, write-once-read-many (WORM) storage, or third-party custody of log records are the implementation patterns that satisfy this requirement. The specific technical choice matters less than the property it guarantees: once a log record is written, no actor in the system -- including the system operator -- can alter it without detection.

Translating to Engineering Requirements: A Minimal Specification

The following table translates Article 12's obligations into a minimal engineering specification. Each row maps a regulatory requirement to the specific technical capability needed to satisfy it.

Regulatory Requirement	Engineering Capability	What Must Be Captured
Automatic recording of events (Art. 12(1))	Decision-point instrumentation	Structured record at every consequential evaluation, not just API calls
Traceability throughout lifecycle (Art. 12(1))	Input capture + logic version pinning	Inputs as evaluated, logic version active at evaluation time, outcome produced
Monitoring for risk situations (Art. 12(2))	Operational anomaly detection logging	Drift indicators, confidence distributions, out-of-distribution input flags
Post-market monitoring support (Art. 12(2))	Longitudinal performance tracking	Outcome distributions over time, disaggregated by relevant segments
Human oversight records (Art. 12(3), Art. 14)	Three-event HITL logging pattern	AI recommendation, human review action, committed decision as separate records
Minimum 6-month retention (Art. 12(4))	Retention policy enforcement	Automated retention with sector-specific extensions; no manual deletion
Authority access (Art. 12(4))	Role-based access control for log data	Separate access tiers for operators, auditors, and regulatory authorities
Log integrity (implied by traceability)	Immutable storage with integrity verification	Append-only records with tamper-detection mechanisms

This specification is minimal -- it covers what is required, not what is ideal. Teams building systems in high-sensitivity domains (financial services, employment, healthcare-adjacent) should treat this as a floor, not a ceiling. The NIST AI RMF's Govern 1.2 subcategory provides additional guidance on accountability structures that extend beyond the EU AI Act's minimum requirements into operational best practice.

The Architecture Gap: Why Application Logging Is Not Decision Logging

The fundamental problem most teams face is architectural, not technical. Standard application logging infrastructure -- whether Datadog, Splunk, CloudWatch, or ELK -- is designed to capture system events: HTTP requests, errors, performance metrics, service calls. It is optimized for operational monitoring: "is the system healthy?" and "where is the latency?"

Article 12 requires a different kind of logging that most application logging frameworks were not designed to support. Decision logging captures evaluation events: "what did the system decide, based on what inputs, using what logic, and with what outcome?" These are not system events -- they are business events with specific regulatory semantics.

The practical consequence is that teams cannot satisfy Article 12 by configuring their existing logging infrastructure differently. They need a separate logging layer -- a decision log -- that is instrumented at the decision point rather than at the infrastructure level. This decision log captures structured records with the specific fields the regulation requires (inputs, logic version, outcome, human oversight indicators), stores them with immutability guarantees, and provides access controls appropriate for regulatory inquiry.

For teams that have already implemented a decision plane architecture, this logging layer is a natural extension: every evaluation through the decision plane generates a structured record that satisfies Article 12's requirements by construction. For teams that have not, the Article 12 compliance project and the decision plane architecture project converge -- the logging requirements essentially mandate the architectural separation of decision logic from application logic that the decision plane concept describes.

Implementation Priority: What to Build First

Teams with an August 2026 deadline face a sequencing question: which of these logging capabilities should they build first? The priority depends on current maturity, but a defensible ordering for most teams is:

Decision-point instrumentation. Before you can log decisions correctly, you need to identify where in your system decisions are actually made. Map every consequential evaluation -- every point where the system commits to an action that affects a user. This is discovery work, not engineering work, but it is the prerequisite for everything else.
Input capture and logic version pinning. At each identified decision point, capture the inputs as evaluated and the version of the logic (model version, prompt version, rule version, configuration state) active at evaluation time. This is the minimum viable traceability record.
Immutable storage. Move decision logs to append-only storage with integrity guarantees. This can be a dedicated audit database, a WORM-compliant object store, or a managed audit log service. The key property is that records cannot be altered after creation.
Human oversight event logging. Implement the three-event pattern for human-in-the-loop decisions. If your system includes human review, this is a compliance requirement, not an optimization.
Operational monitoring data. Add drift detection, confidence distribution tracking, and out-of-distribution input flagging. This supports the Article 12(2) requirement for monitoring risk situations and enables the Article 9 risk management system to operate on actual data rather than assumptions.

Steps one and two can often be completed in weeks. Step three is an infrastructure project that may take longer depending on existing storage architecture. Steps four and five are ongoing operational capabilities that mature over time.

The Relationship Between Article 12 Logging and Decision Traces

Teams familiar with decision trace architecture will recognize that Article 12's requirements are substantially satisfied by a well-implemented decision trace system. A decision trace that captures input atoms, evaluated rule versions, committed outcomes, and human oversight events in an immutable record provides the specific technical capabilities that Article 12 demands.

The alignment is not coincidental. Decision traces were designed as the minimal audit record that makes AI decisions defensible under regulatory scrutiny. Article 12 formalizes the same intuition from the regulatory side: systems that make consequential decisions must produce records that enable those decisions to be traced, reviewed, and evaluated after the fact.

The practical implication for teams is that Article 12 compliance and decision trace implementation are the same project. Building decision traces to satisfy operational governance needs also satisfies the regulatory logging requirement. Building logging to satisfy Article 12 also produces the decision trace records needed for operational governance, audit readiness, and incident investigation.

Beyond Compliance: Why This Logging Matters Operationally

It is tempting to treat Article 12 as a compliance checkbox -- something you build because the regulation requires it, not because it serves the business. This framing understates the operational value of the logging capabilities the Act demands.

Decision-level logging with input capture, logic version pinning, and immutable storage is the foundation for three capabilities that most AI teams need regardless of regulatory context:

Incident investigation. When a customer reports that the system made an incorrect decision, decision logs tell you exactly what happened: what the system saw, what logic it applied, and what it decided. Without decision logs, incident investigation involves reconstructing probable inputs and guessing at which version of the logic was active -- a process that is slow, unreliable, and produces "best guesses" rather than definitive answers.
Rule change validation. When you change decision logic, decision logs provide the baseline for comparison: how does the new logic behave compared to the old logic on the same population of inputs? This is the data that makes safe rollout practices possible -- without historical decision records, shadow mode evaluation has nothing to compare against.
Performance attribution. When business metrics change -- conversion rates shift, churn increases, denial rates spike -- decision logs allow you to attribute the change to specific logic changes, input distribution shifts, or model behavior changes. Without decision-level logging, attribution requires correlating business metrics with deployment timelines and hoping the correlation reveals causation.

The teams that treat Article 12 as an opportunity to build decision logging infrastructure rather than as a compliance burden will find that the infrastructure serves them well beyond the regulatory context. The regulation is demanding capabilities that well-governed AI systems should have had all along.