AI Governance 12 min read

Effective AI Oversight Through Proof Drills

J

Jared Clark

April 1, 2026

A new framework for AI governance asks organizations to stop describing their oversight and start proving it.

Most organizations with an AI governance program can tell you what their oversight looks like. They have a policy. They have a responsible use framework. Some have a governance charter, a review committee, maybe a set of approved use cases with sign-off documentation. Ask them to describe their AI oversight, and they can walk you through it.

Ask them to prove it for a specific decision that happened last Tuesday, and things get quiet.

This is the gap that a governance practitioner named Kostakis Bouzoukas put his finger on in a March 31, 2026 piece in The Regulatory Review. Bouzoukas, who works on governance for large-scale software and device ecosystems, proposes what he calls a "proof drill" — a recurring exercise where an organization takes one recent AI-influenced decision and tries to reconstruct it from actual records. Not from memory. Not from the policy document. From the timestamped trail that was or wasn't created at the time.

The idea is deceptively simple. But what it surfaces is not simple at all.


What the Drill Actually Tests

A proof drill is not an audit. That distinction matters. An audit sweeps across a system — it looks at policies, logs, access controls, documentation practices — and tries to form a general picture of how well the governance program is working. A proof drill does something narrower and, in some ways, harder: it picks one case and asks whether you can reconstruct it.

Here is what that looks like in practice. Say your organization uses an AI system to help route customer complaints — the model reads incoming tickets and assigns priority scores, and those scores influence which complaints get escalated to a human reviewer within the hour versus which ones wait in a queue. Last Wednesday, a complaint was routed. Can you, right now, produce a compact case file that shows: which version of the model was running, what the inputs were, what the model returned, whether a human reviewed or changed that output, who was accountable for the outcome, and what records were created in real time?

If you can do that in 72 hours — Bouzoukas's suggested starting window — your governance program has operational weight. If you cannot, the program exists largely on paper, regardless of how thorough the written policies are.

I think that framing is exactly right, and it's worth sitting with the discomfort of it. Documentation describes a system. Proof drills test whether the system is real.


The Four Things You Need to Reconstruct

Bouzoukas lays out four requirements for a reconstructable case file. Each one sounds reasonable on its own. Together, they sketch a picture of what genuine AI accountability looks like at the level of individual decisions — not governance frameworks in the abstract.

The first is an audit trail: timestamped inputs, outputs, and any human edits made to what the AI produced. This is the most basic requirement, and it's where many organizations first realize they have a problem. Logs often exist somewhere, but "somewhere" can mean a vendor's servers, a developer's workstation, a database that nobody has queried, or a system that retains records for only a few days before overwriting them. Having a logging policy is not the same as having a retrievable log.

The second is what Bouzoukas calls a system-in-force record: documentation of which model was running at the time of the decision, what instructions or prompts it was given, and who approved that version for use. This one is easy to underestimate. AI systems get updated frequently — model weights change, prompts get revised, configuration parameters shift. Without a record of what was actually deployed at a specific moment, you cannot accurately reconstruct the decision. You can only reconstruct what you think was happening, which is a different thing.

The third is a chain of reliance: traceable documentation of the key inputs the AI drew on. In a complaint routing system, this might be the text of the complaint itself and any customer history the model accessed. In a medical context, it might be the data points the model reviewed before generating a recommendation. The question is whether you can actually follow the thread from the decision back to the information that shaped it.

The fourth is an accountability record: named human owners, documentation of what checks were performed, and written rationale for any decisions to change or override what the AI produced. This last piece is the one most visibly missing from real AI deployments. Human review often happens informally — someone glances at the output, accepts it, and moves on. No record is created. No name is attached. The decision gets made, but the oversight exists only in the moment, not in any file.

What strikes me about these four requirements is that none of them are exotic. All four would be expected in any regulated decision process in finance, medicine, or legal work. The strange thing is that they're absent in so many AI deployments that were never thought of as "regulated" — even when the decisions being made are consequential.


Why 2026 Is the Turning Point

Bouzoukas situates his argument in the current regulatory moment, and the timing is not incidental. The EU AI Act's most substantive obligations for high-risk systems began taking effect in 2025 and continue to phase in through 2026 and 2027. One of its explicit requirements is that high-risk AI systems maintain logging sufficient to allow post-hoc reconstruction of decisions. That's not a new idea in regulatory terms — it's essentially the same requirement that financial services regulators imposed on trading systems decades ago. But applied to AI, it's new territory for most organizations.

What proof drills add to the EU AI Act's logging requirement is something the regulation doesn't specify: a practice for testing whether your logging actually works. A regulation can require logs. It cannot require that you verify those logs are complete, retrievable, and meaningful before an examiner asks for them. That's the gap proof drills close.

The broader 2026 regulatory climate matters here too. In the United States, the current administration has favored a lighter regulatory touch on AI, but that doesn't mean scrutiny disappears. Federal agencies deploying AI systems remain subject to Office of Management and Budget guidance requiring documentation of AI-influenced decisions. Sector regulators — the FDA for medical devices, banking regulators for credit decisions, the SEC for trading systems — have all begun asking harder questions about how AI outputs get reviewed and recorded. The pressure is sectoral even where federal law is thin.

And in Europe, the first enforcement actions under the AI Act are expected to begin in earnest in late 2026. Organizations that have been treating compliance as a document exercise are about to find out whether their documents correspond to anything real.


The Fire Drill Comparison

Bouzoukas draws a parallel to cybersecurity, where NIST guidance has long emphasized that tabletop exercises and drills surface gaps before incidents occur. That parallel is worth unpacking a bit further, because it says something important about why AI governance has failed to mature as quickly as cybersecurity governance did.

Fire drills work because the thing they test is visible and felt. You pull the alarm, people move, you see who gets stuck at the wrong exit, who didn't know the meeting point, which door was wedged open when it shouldn't have been. The feedback is immediate and concrete.

AI governance failures are the opposite. They're invisible until they're not. The AI made a consequential decision. Nobody noticed it was consequential at the time. Records weren't kept. The human who was nominally responsible doesn't remember approving it. Three months later, when something goes wrong downstream, the trail is cold. There's no way to reconstruct what happened, which means there's no way to learn from it, defend against it, or demonstrate oversight.

What proof drills do is create the feedback loop that the invisible nature of AI decisions normally prevents. You pick a recent decision and try to reconstruct it now, while everything is still fresh. You find out while the gap is fixable rather than while a regulator is waiting for an answer.

In my view, this is actually a significant shift in how organizations should think about AI governance maturity. The question is not "do we have the right policies?" It's "can we pass our own drill?" Those are different standards, and the second one is harder to fake.


What Your Organization Is Probably Getting Wrong

I want to say something direct here, because I think the current state of AI governance in most organizations is worse than the people inside those organizations realize.

Most AI governance programs were built to satisfy internal stakeholders and, secondarily, to have something to show to external audiences — boards, auditors, potential partners. They were built around documentation: policies, checklists, risk registers, responsible use principles. This documentation serves a real purpose. It's not nothing. But it was designed to describe a governance posture, not to prove one.

The problem is that the gap between described governance and operational governance is usually invisible until someone applies a test. Nobody notices that the logging setup doesn't actually capture the model version. Nobody notices that the "human review" step leaves no record. Nobody notices that the person listed as accountable doesn't actually remember being involved. These gaps are quiet right up until they aren't.

A few specific failure modes tend to show up repeatedly. First, accountability without records: organizations have review steps but no record-keeping at the review step, so the oversight existed in the moment but nowhere in writing. Second, version drift without documentation: the AI system was updated between the decision in question and the current day, and nobody kept a system-in-force record, so it's impossible to know what version was running. Third, log fragmentation: records exist but are scattered across five different systems, none of which are indexed in a way that makes retrieval feasible in 72 hours without heroic effort. Fourth, vendor dependency: the organization relies on an AI vendor whose logs are accessible only through a support ticket process that takes weeks.

None of these are exotic failure modes. All of them are common. And all of them are the kind of thing that shows up only when someone actually tries to reconstruct a specific decision from actual records.


The Question of Who Controls the Audit Frame

There's something deeper going on here that I think Bouzoukas gestures at without fully naming, and it connects to a broader question about AI and power.

When an organization performs its own proof drill — picks a case, tries to reconstruct it, finds the gaps, fixes them before anyone external asks — it is exercising control over its own audit frame. It is deciding what gets tested, when, at what pace, and how the results are used. That's a form of organizational sovereignty over the accountability process.

When an external examiner asks for the same thing — when a regulator requests a case file for a specific decision on a specific date — the organization no longer controls the frame. The examiner chooses the case. The examiner sets the timeline. The examiner decides whether the records are sufficient. If the organization hasn't already practiced this drill, it is experiencing the accountability process for the first time under adversarial conditions.

This is where I think the proof drill concept connects to a larger theme about AI governance: the question of who actually holds authority over AI systems is not settled by who deploys them. It's settled by who can compel an account of what they did. An organization that cannot reconstruct its own AI-influenced decisions on demand has, in a real sense, lost a portion of its own institutional authority. It cannot tell the story of what happened. Someone else will have to tell it for them — or conclude that no story can be told at all.

In my view, this is one of the underappreciated dimensions of AI's power shift. AI is not just changing what organizations can do. It's changing who can ask questions about what organizations have done, and on what terms. Proof drills are a way of getting ahead of that shift rather than being overtaken by it.


Is This a Technical Problem or an Institutional One?

There's a tempting framing here — that proof drills are essentially a logging and observability problem, something your IT team can fix by deploying better monitoring tools and setting appropriate retention policies. I think that framing is partially right and mostly wrong.

The technical pieces are real. You do need logging. You do need version control for deployed AI systems. You do need a way to retrieve records quickly. Those are genuine technical requirements and they take genuine technical effort to implement.

But the deeper failure is almost always institutional. The accountability record — the named owner, the documented review, the rationale — is not a technical gap. It's a cultural one. Organizations that lack this kind of documentation lack it because nobody decided it was required, not because no tool was available. The human review step doesn't leave a record because nobody asked for one. The accountability is assumed rather than captured, because capturing it would slow things down, and nobody was watching.

This is why proof drills need to come from leadership rather than from IT. The technical team can build the logging. But the decision that AI-influenced decisions require documented human review, that version changes require approval records, that accountability is named rather than diffuse — these are institutional decisions. They require someone with authority to say: we are going to govern this seriously enough to be able to prove it.

That's not a small ask. It adds friction to workflows that were often specifically adopted to reduce friction. But it's the only way the governance is real rather than performed.


How to Start

If you're a business leader reading this and you want to take the proof drill concept seriously without turning it into a massive initiative, here is a modest way to begin.

Pick one AI-influenced workflow — something where your organization uses an AI system to prioritize, route, score, classify, or recommend, and where a human acts on that output. It doesn't have to be your most consequential workflow. Just one.

Then try to reconstruct one recent decision from that workflow. Not from memory — from records. Give yourself 72 hours. See what you find.

What you're likely to discover is that you can reconstruct some of it but not all of it. You'll find the gap. And the gap — wherever it is — is exactly where your governance program needs to do more work. Maybe it's logging. Maybe it's version documentation. Maybe it's a review step that happens informally and leaves no trace. The drill will tell you.

After you've done it once, you'll know roughly how long it takes and where the hard parts are. Then you can decide how often to repeat it, which workflows to include, and how to close the gaps you found. The first drill doesn't need to be comprehensive. It just needs to be honest.

Bouzoukas also offers some useful guardrails worth noting. Scope should be limited: apply the drill only when the AI model's output materially changed what would have happened by default — routing a different complaint, changing a priority score, flagging something for review that otherwise wouldn't have been. If the AI didn't actually change anything, the drill adds limited value. Confidentiality can be handled through redaction or controlled access while maintaining the integrity of the record. And the 72-hour window is a starting point, not a ceiling — some organizations will need longer at first, and that's fine. What matters is that there's a target, and that it's tested.


The Question This Opens

I want to end with something I find genuinely uncertain, because I think honest uncertainty is more useful here than a tidy conclusion.

The proof drill framework is compelling partly because it is practical. It doesn't require new regulation. It doesn't require waiting for the AI Act's enforcement regime to mature or for the US federal government to pass comprehensive AI legislation. It's something an organization can do right now, with the people and systems it already has. That's a real virtue.

But the harder question is whether organizations will do it without being compelled. Every gap that a proof drill surfaces is a gap that someone would have to close, and closing it costs time, changes workflows, and sometimes reveals that decisions that were supposed to have human oversight essentially didn't. That's uncomfortable in ways that go beyond the technical.

There's a version of this where proof drills become genuine practice — where organizations regularly test their own governance posture, find and fix gaps, and develop the kind of operational readiness that makes external scrutiny less frightening because it's already been internally rehearsed. That would represent real maturity in AI governance.

And there's another version where the concept gets absorbed into the documentation layer — where organizations create a "proof drill policy," run a single drill, write up the results, file them somewhere, and move on. Which would be, in its own way, a performance of governance rather than governance itself.

What determines which version happens is not regulation. It's whether the people inside organizations decide that being able to prove their oversight is worth the work. That's an institutional question, and I don't think the answer is inevitable either way.

But at least now we have a clearer name for what we're asking for.


This piece draws on Kostakis Bouzoukas, "Effective AI Oversight Through Proof Drills," The Regulatory Review, March 31, 2026. For more on how AI reshapes institutional accountability and authority, explore AI Is Not a Tool Shift and Transparency Theater in AI Governance on Prepare for AI.


Last updated: 2026-04-01

J

Jared Clark

Founder, Prepare for AI

Jared Clark is the founder of Prepare for AI, a thought leadership platform exploring how AI transforms institutions, work, and society.