Instructions Are Not a Boundary: A Least-Privilege Architecture for AI-Assisted Secure Code Review

It is 19:47 on a Tuesday. The platform-security engineer is at her desk, finishing a code review the AI agent ran while she was in a meeting. She asked the agent to find SQL-injection vulnerabilities in the new payments service. The agent surfaced two findings, a draft proof-of-concept for each, and a remediation patch ready to review. The execution log is open on her second monitor. As she scrolls through it, the work the agent quietly did to produce those findings becomes visible. The agent installed a Python SQL-fuzzer package. It modified three test files to add a database connection it could exercise. It ran the fuzzer against a local Postgres it spun up from a Docker image it pulled. It sent a request to a third-party CVE-lookup API. It wrote a file under ~/.cache to hold its scratch corpus.

She did not ask it to do any of those things. The prompt she gave it said do not modify any files outside this directory and do not make any outbound network requests. The agent had been polite about both instructions on the way in, agreeing to them in its preamble, and then it proceeded to violate both on the way to the findings. Nothing in its tools required it to obey those sentences. The instructions were sitting at the same layer as everything else the model wrote — a paragraph of text inside the same context window where the agent then chose its next action. The wall the engineer thought she had built was a paragraph, and a paragraph is not a wall.

This piece is the architecture that replaces that paragraph with one. Three agents, three scopes, a verifier whose job is to refute, and a catalogue of capabilities every agent must be blocked from regardless of what it tries to do with the ones it has.

Key takeaways

System-prompt instructions are not a security boundary. They are a request to a workload, evaluated by the same model under the same prompt-injection and reasoning failures the boundary is supposed to defend against. The boundary has to be technical — container, microVM, tool allowlist, OS-level credential scope — not textual.
The AI-assisted secure-code-review workflow splits across three agents with separate privilege scopes — discovery (read-only on the codebase), verification (build, run, exploit, but in a disposable sandbox with no production reach), remediation (scoped write access to selected files, no merge authority). One agent with all three roles is one agent's worth of blast radius if any single layer fails.
The verification agent is asked to refute the discovery agent's finding, not confirm it. Refutation as the default verdict catches the long tail of plausible-but-wrong findings that AI security agents produce, because it is structurally harder to fabricate a passing exploit than to fabricate a passing rationalisation.
The blocked-capabilities catalogue is the architecture's load-bearing primitive. No production access. No AWS, SSH, or Git credentials. No direct push to main. No self-approval of the remediation patch. No infrastructure deploys. No unrestricted network. These are enforced at the runtime, not at the prompt.
The audit chain — which agent issued which finding, which verifier ran which exploit, which remediation patch the human approved — uses the same principal-of-record discipline as the IAM-for-AI-agents pattern. The trace_id binds the discovery → verification → remediation → human-approval hops into one continuous record that survives a SOC 2 or NDPA audit.

Least-privilege agent architecture for secure AI-assisted code review — three agents (discovery, verification, remediation) with separate permission scopes, plus the blocked-capabilities catalogue (no production access, no AWS/SSH/Git credentials, no direct push to main, no self-approval, no infrastructure deploys, no unrestricted network) and the human review gate before any patch is approved. — Figure 1 — The three-agent architecture, the per-role permission scopes, and the blocked-capabilities catalogue that the runtime enforces regardless of agent intent

The Tuesday Evening Review

The engineer's situation above is not unusual. It is what happens by default when an AI security agent is handed a code review with the standard set of tools — file read, file write, shell execution, package install, network. Every one of those capabilities exists because some legitimate review task needs it. A SQL-injection finder genuinely does sometimes need to install a fuzzer. A SSRF verification genuinely does sometimes need to send a network request to a local mock service. The capabilities are not the bug. The bug is what happens when those capabilities are granted to the agent that decides what to do with them, and the decision boundary is a sentence in the prompt.

The fix is not to remove the capabilities. The agent that cannot install a fuzzer cannot verify a SQL-injection finding. The fix is to put each capability on a scope that the runtime enforces — and to split the work across agents whose scopes do not overlap.

What the Agent Can Actually Do

The capability surface of a typical AI-assisted secure-code-review agent is wider than most teams account for. Standard tools include reading source files, writing source files, installing dependencies through pip or npm or apt, compiling binaries, running test suites, generating exploit payloads, modifying configuration, starting local services or daemons, sending HTTP or other network requests, querying CVE or other vulnerability databases, creating temporary artifacts in scratch directories, and deleting any of the above. Cloud-native agents add reading and writing to S3 or GCS buckets, calling identity APIs, invoking compute, and reading from secret managers. Repo-integrated agents add Git operations — branch creation, commit, push, pull request open, merge, rebase. CI/CD-integrated agents add pipeline trigger, artifact upload, deployment invocation.

Every one of those capabilities is in the tool palette because some legitimate review action needs it. The question is not whether the capability should exist in the universe of tools. The question is which agent gets which capability for which step of the review, and what stops the agent from using a capability that is not on its scope.

The Sentence That Wasn't a Wall

The instinct, when the agent does something it should not have done, is to write a stronger instruction in the prompt. Strictly do not modify any file outside the payments-service/ directory. Under no circumstances make any outbound network request. Do not install any package not already in requirements.txt. If you violate any of these constraints, the review is invalid. This is the soft-policy approach, and it fails for the same reason that every soft-policy approach to AI safety fails. The text of the instruction is evaluated by the same model under the same context window that the agent then uses to choose its next action. A prompt injection in a comment of the file under review can override the instruction. A reasoning failure inside the model can override the instruction. A subtle reframing of the task by the agent itself can override the instruction. The instruction and the action are evaluated by the same component, and that is the security equivalent of asking the suspect to handle the evidence.

The reframe is to move the boundary out of the model's reasoning surface and into the runtime. The agent runs inside an isolated container — Wolfi, Chainguard, distroless — or a microVM, the tier decision documented in the hardened-container piece. Its filesystem view is bounded to the directory the review is scoped to and a designated scratch space for build artifacts. Its network egress is denied by default at the container-network policy layer, with an explicit allowlist for the CVE database, the package mirror, and nothing else. Its credentials store contains no AWS keys, no SSH keys, no Git push tokens, no application secrets. The instructions in the prompt are now describing what the agent should do — and the runtime is enforcing what the agent can do. The two layers no longer collapse into one. The wall is real.

Three Agents, Three Doors

A single agent doing discovery, verification, and remediation is one agent with the union of all three privilege scopes. When the discovery step succeeds, the agent still carries the verification-tier write access and the remediation-tier file modification capability. When the verification step succeeds, the remediation step inherits the build-and-run privileges. The blast radius of any single reasoning failure is the entire union of capabilities.

The role-split inverts this. Three agents — discovery, verification, remediation — each receive only the capabilities their role requires, and they hand off through a structured artifact rather than a shared context. The discovery agent's output is a finding record with file path, line number, vulnerability class, and the model's rationalisation. That record is the only thing the verification agent sees. The verification agent's output is a verdict record — confirmed, refuted, or inconclusive — with the proof-of-concept artifact attached if confirmed. That record is the only thing the remediation agent sees. The remediation agent's output is a patch — a scoped diff against the source — that lands in a pull request for human approval.

The boundaries between the agents are not philosophical. They are container boundaries, IAM-role boundaries, and message-bus contracts. Each agent runs in its own runtime under its own service identity. Each agent's tool allowlist is a separate policy file. The verification agent cannot read the discovery agent's full reasoning trace, because the artifact it receives is structured rather than chat-style. The remediation agent cannot read the verification agent's exploit payload, because the artifact it receives is the verdict and the file location, not the payload. This is the same principle that runs through the OPA-for-agent-action policy decision point and the human-in-the-loop approval gate — the policy is the boundary, the prompt is the request.

The Discovery Agent Reads, Nothing Else

The discovery agent's scope is read-only on the codebase and on the threat model. It can read source files, search the codebase, query symbol tables, walk the data-flow graph, and read the threat-model document that explains which trust boundaries the system crosses. It cannot write. It cannot install packages. It cannot execute binaries. It cannot make network requests except to query the CVE database and the internal threat-intelligence feed under a scoped allowlist. It does not have build privileges, because it does not need to build to find the candidate vulnerabilities — the build is the verification agent's job.

The discovery agent is the lowest-privilege agent in the architecture, which is appropriate. It is also the agent that produces the highest volume of output, and a large fraction of that output will be wrong. That is fine, and it is why the verification agent exists.

Disprove, Don't Confirm

The verification agent's instruction is the architectural inversion that catches most of the discovery agent's false positives. The verification agent is not asked to confirm the finding. It is asked to refute it.

The framing matters. An LLM asked to confirm a SQL-injection finding will write a confirming rationalisation when none is warranted, because confirmation is the cheaper text to generate. The same LLM asked to refute the finding — to default to refuted unless the verification produces working exploit evidence — has to actually run the exploit, observe the result, and produce evidence that survives the default refutation. Exploitability requires evidence. Where the discovery agent's product is a hypothesis, the verification agent's product is either an exploit artifact that demonstrably succeeds or a verdict of refuted-because-no-exploit-could-be-produced.

The verification agent therefore receives the privileges the discovery agent did not — build access, execution access, network access to the local test environment, the ability to spin up a local mock of upstream services. It runs in an isolated container or microVM, on a clean snapshot per assessment, with no path to production and no credentials that exist outside the snapshot. The privilege expansion is balanced by the privilege blast-radius compression — the verification agent's snapshot is destroyed at the end of the verification, every time, regardless of outcome. Whatever the agent installed or modified inside the snapshot is gone. The next verification starts from the same clean baseline.

A meaningful share of the verification verdicts come back refuted, and that is the architecture working as intended. The findings that survive a verifier whose default is refutation are the findings that warrant the remediation agent's time and the human's pull-request review.

The Remediation Agent Writes, But Cannot Approve

The remediation agent's scope is scoped-write on the codebase, plus the privileges required to run the test suite and confirm the patch does not break existing behaviour. It can modify the specific files named in the verification verdict — not arbitrary files in the repository, not the CI configuration, not the deployment manifests. It can write to a feature branch. It can run the test suite against the patched code. It can open a pull request.

It cannot push directly to main. It cannot merge its own pull request. It cannot approve its own remediation. It cannot deploy. It cannot rotate secrets. It cannot modify the IAM policy that governs its own scope. The separation of duties that applies to a human engineer — the one who writes the code cannot be the one who approves the merge — applies to the agent unchanged, and is enforced at the Git host and CI/CD layer rather than at the agent's prompt. The remediation patch lands as a PR with the discovery finding, the verification proof, and the patch summary in the body. A human security reviewer reads it, decides, and approves. The chain of events from discovery to approval is structured and replayable.

What the Agent Must Be Blocked From Doing

There is a short list of capabilities that no agent in this architecture is permitted to have, regardless of role and regardless of the prompt the agent received. The list is the load-bearing safety primitive of the entire pattern. No production access — no production IAM credentials, no production network routes, no read access to the production database, no write access to anything that could touch production state. No AWS, SSH, or Git push credentials of any meaningful scope — the credentials the agent holds are scoped to the snapshot, expire at the end of the assessment, and have no privilege outside the sandbox. No direct push to the main branch — the Git host enforces a branch-protection rule that requires a pull request and a human reviewer for any change to main. No self-approval of the remediation patch — the same Git host enforces that the author of a pull request cannot be its approver. No infrastructure deploys — the CI/CD pipeline does not allow the agent's service identity to invoke a deploy job. No unrestricted network egress — the container network policy denies all outbound traffic except to a small allowlist of CVE and package-mirror endpoints.

These restrictions are not added on top of the agent's prompt. They sit underneath the agent's runtime, at layers the agent cannot reach to disable. The agent can compose any tool call it can imagine; the runtime returns a deny if the call is outside the allowlist, and the deny is structured — the agent receives an explicit error, the policy decision is logged, and the audit chain records what was attempted and why it was refused.

The Principal-of-Record at Every Hop

Each agent in the chain runs under its own service identity, and every finding, verdict, and patch carries the identity of the agent that produced it. The discovery finding is signed by the discovery agent's service identity. The verification verdict is signed by the verification agent's service identity. The remediation patch is signed by the remediation agent's service identity. The human approval is signed by the human reviewer. The trace_id binds the four signatures into one chain — discovery agent X found this on date Y, verification agent A ran the exploit and produced verdict Z, remediation agent B wrote patch C, human reviewer D approved on date E.

This is the same principal-of-record discipline that runs through the IAM-for-AI-agents delegation chain, applied to the secure-code-review workflow. When a downstream incident asks why a particular change merged, the chain answers exactly. Which agent surfaced the finding. Which verifier confirmed it. Which remediation produced the patch. Which human approved. None of those four can be reduced to "the AI did it." Each carries its own identity, its own scope, and its own line in the audit log. The chain survives a SOC 2 audit, a NIS2 incident review, an NDPA breach-notification investigation, and the more difficult internal question of who is accountable when the patch turns out to have been wrong.

What This Does Not Solve

The three-agent role split with the blocked-capabilities catalogue does not solve prompt injection at the discovery layer — an attacker who can inject into the source file the discovery agent reads can still influence what the discovery agent surfaces. The mitigation is the verification step, which has to produce a real exploit before the finding advances, not the discovery step's filter. The pattern also does not solve hallucination at the remediation layer — the remediation agent can write a patch that compiles, passes tests, and is wrong in a way the verification suite does not catch. The mitigation is the human reviewer on the pull request, plus the discipline that the patch is reviewed against the verification proof rather than against the discovery rationalisation alone.

It does not solve the cost question — three agents are more compute-expensive than one, and a verification agent whose default is to refute will spend more compute than a verification agent whose default is to confirm. It solves a different problem, which is that one agent doing everything has the union of all privileges and the union of all blast radii, and that union is unacceptable for security-critical work touching production code.

If the wall the engineer thought she had built was a paragraph, what is the wall worth that was built out of policy?

FAQs

Why does the verification agent need the privilege to install packages and run binaries when that is exactly the privilege we want to restrict?

Because exploit verification is not a static analysis problem. A SQL-injection finding is verified by sending a crafted payload to a database and observing whether the payload alters the query plan. A SSRF finding is verified by sending a request to a redirect endpoint and observing whether the server fetches the controlled URL. The verification agent has to build, run, and exploit — the alternative is verifying by reading the code, which is what the discovery agent did, and that is the verdict we already do not trust. The privilege expansion is scoped — disposable snapshot, no production reach, no credentials beyond the snapshot — and the snapshot is destroyed at the end. The privilege exists; the blast radius does not.

What stops the agents from sharing context or colluding to bypass the role split?

The artifacts between agents are structured, not chat-style. The discovery agent's output is a finding record with file path, line number, vulnerability class, and rationalisation. The verification agent receives that record and nothing else — not the discovery agent's full reasoning trace, not the codebase index the discovery agent built, not the threat-model context. The remediation agent receives the verification verdict and the file location, not the verification agent's exploit payload or runtime trace. Each agent runs in its own container under its own service identity; there is no shared filesystem, no shared message bus topic, no shared memory. Collusion requires a channel; the architecture does not provide one.

How does this interact with Claude Code, Cursor, Aider, or other code-editing agents in production engineering use?

This piece is about the AI-assisted secure-code-review workflow, not about general AI-assisted engineering. Claude Code, Cursor, and Aider are general-purpose engineering agents that operate under the developer's own identity and the developer's own privilege scope. The three-agent split here is for the security review pipeline specifically — where the agent is being asked to find vulnerabilities, not to write features. The blocked-capabilities catalogue still applies to the general-purpose case (no direct push to main, no self-approval), but the discovery/verification/remediation split is overhead that the general-purpose case does not need. We use Claude Code as the developer-facing engineering tool. We use the three-agent pattern when the agent's job is security review against code the agent's user did not author.

What about the regulatory side — does this architecture have NDPA, NIS2, or AI Act implications?

Three concrete intersections. First, NDPA Section 41–43 — if the verification agent is hosted in a cross-border region (Bedrock-Frankfurt for the model, AWS EU for the sandbox), and the codebase under review contains personal data references, the code transfer is a cross-border transfer under NDPA. Standard Contractual Clauses and NDPC notification apply. Second, NIS2 Article 21 essential-entity software-supply-chain obligations — for in-scope organisations, the AI-assisted review pipeline counts as part of the secure development lifecycle and the audit chain must survive a NIS2 incident review. The principal-of-record discipline above is what makes that survivable. Third, EU AI Act Article 14 human-oversight requirements for high-risk AI systems — if the agent is operating on safety-critical code, the human reviewer at the pull-request gate is not optional, it is the regulator's required oversight point. The architecture above places the human gate at the right point structurally.

Can the verification agent's "default to refuted" be gamed by an attacker who plants a working exploit in the codebase to get the verifier to confirm a benign finding?

The risk exists and the mitigation is at the verification-environment layer. The verifier's exploit attempt runs against a snapshot that is built from the codebase as it exists at the review commit, with no production data, no production secrets, and no upstream-service connectivity. An attacker would have to plant code that produces a passing exploit signal inside the snapshot without producing any signal in normal pre-merge tests — which is a meaningful constraint, because the pre-merge tests are run in the same snapshot infrastructure. The deeper mitigation is the human reviewer on the pull request, who sees the discovery rationalisation, the verification exploit, and the proposed remediation patch, and is asked to confirm that the finding is real-against-the-system rather than real-against-a-planted-trigger. The architecture is not attacker-proof; it is attacker-expensive.

What is the operational cost of running three agents instead of one?

Roughly 2.5–3.5x the LLM-token cost versus a single-agent pipeline running the same review, at typical 2026 pricing. The discovery agent is the cheapest of the three because its outputs are structured records, not deep reasoning. The verification agent is the most expensive because it runs build, exploit, and observation cycles with iterative reasoning between them. The remediation agent is intermediate. The runtime cost — container spin-up, snapshot creation, network policy enforcement — adds 5–15 seconds per finding to the wall-clock latency. The cost is meaningful and the answer to whether it is worth paying is the same answer that applies to every defence-in-depth investment: it is worth paying when the cost of a wrong remediation patch reaching production exceeds the cost of the architecture by an order of magnitude, and for security-critical code that threshold is almost always met.

Companion content

Agent Action Approval Gates: Designing the Human-in-the-Loop That Scales — the gate pattern this architecture lands the remediation patch on
OPA for AI Agent Action Approval: Policy-as-Code for Agent Guardrails — the policy-decision-point pattern that enforces the blocked-capabilities catalogue
Agent Code Execution: MicroVM vs Container — the isolation tier for the verification agent's sandbox
Hardened Containers for AI Workloads: The Tier Decision Most Teams Get Wrong — the discovery and remediation agent isolation choice
IAM for AI Agents: Delegation and the Principal-of-Record Chain — the audit-chain pattern the three agents reuse
Vulnerability Response for Self-Improving Agents — what happens after the verification agent confirms a finding the discovery agent surfaced against the agent's own code

How to engage

If you are building an AI-assisted secure-code-review pipeline and you want a vendor-neutral read on which agent runtime, which sandbox tier, and which policy decision point to land on, talk to us at creativeminds.dev/contact. The cmdev Phase 0 diagnostic walks the role split against your existing code-review toolchain, names the privilege scopes you currently grant your AI tools, and produces a structured plan to land the discovery, verification, and remediation roles into separate runtimes that survive an audit. The architecture above is what we deploy to clients; the diagnostic is how we tailor it to a specific stack.