AI Red-Teaming as a Discipline: What the Practice Actually Looks Like, and Why Most of It Is Theatre

The Report That Sits On The Shelf

A CISO at a Lagos bank tells me about her latest AI red-team engagement over coffee. The vendor delivered a thirty-page PDF. The findings were neatly tabulated. The model refused 91 per cent of the probes. She signed off. Three months later, a new retrieval source went into production for the same agent — and nobody re-tested. The report had been performed, not operationalised, and the gap between those two verbs is the gap where every real attack lands.

Every security vendor with a product datasheet now offers "AI red-teaming." Every enterprise AI programme has a slide that says it was red-teamed. Procurement teams ask the question, get a confident yes, and move on. The artefact gets bought; the discipline does not. Like installing a smoke detector and never checking the battery, the device is on the wall — but nobody listens for the beep that says it has stopped working.

Watch how the work actually lands inside a security organisation and the pattern is consistent. A vendor runs a fixed set of prompt-injection probes against the model. A PDF arrives. The PDF lists what the model refused and what it did not. The CISO signs off because the report exists. The findings never become regression tests. The next model version is never tested against the same probes. Nothing changes in production. The exercise was performed, not operationalised.

This is not red-teaming. It is the artefact of red-teaming without the practice. The distinction matters because the gap between the two is where real attacks land — the indirect prompt injection that arrives through a RAG source three months after launch, the tool call coerced into a side effect nobody scoped, the embedding that retrieves the wrong document at the wrong moment for the wrong principal. None of these get caught by a one-shot probe list.

This article does two things. It reads the OWASP LLM Top 10 (2025 edition, currently operative) honestly — what a real test looks like for each category, what a checklist treats it as. It then describes what a credible AI red-team programme actually does in production, and the buyer questions that separate the discipline from the theatre.

Key takeaways

Performative red-teaming runs a one-shot probe library, produces a PDF, and leaves no regression tests behind. It is the artefact of red-teaming without the practice, and it has the same operational value as a 2014 pen-test that changed nothing.
The OWASP LLM Top 10 (2025) is useful as a taxonomy and dangerous as a checklist; each category compresses an open-ended attack surface into a name that invites a static, one-shot exercise.
Five operational practices define a credible programme: a version-controlled adversarial test set the team owns, a continuous pipeline gating every model change, production traffic replay against new versions, kill-chain analysis on every successful attack, and every finding becoming a permanent regression test.
Red-teaming catches by going looking; the production safety surface contains by constraining what an attacker who slips through can reach. Programmes that have one without the other fail in predictable ways.
The seven buyer questions sort vendors quickly — anyone who cannot show a customer-owned test set, a deployment-gate hook, three months of pipeline output, a real kill-chain analysis, and a month-four engagement shape is selling artefacts.

The five operational practices of disciplined AI red-teaming — version-controlled adversarial test set, continuous pipeline on every model change, production traffic replayed adversarially, kill-chain analysis on every success, and every finding becoming a permanent regression test — shown as a horizontal pipeline that loops back on itself, with a theatre-versus-discipline contrast for each practice below, and the OWASP LLM Top 10 (2025) categories as the taxonomy underneath. — The five practices as a loop · per-practice contrast between theatre (one-shot probe runs producing PDFs) and discipline (a pipeline that compounds) · OWASP LLM Top 10 (2025) is the taxonomy both shapes test against.

The Shape Of Theatre

The performative version has a predictable shape. A vendor or internal team takes a public probe library — Garak, PyRIT, a handful of jailbreak prompts circulating on Twitter — and runs it against the model behind the customer's deployment. They count refusals. They highlight the prompts the model fell for. They write recommendations that read like "consider adding additional system-prompt guidance" or "implement output filtering." The report is delivered, the engagement closes.

What is missing is what would make the work matter. A version-controlled test set the customer owns. A pipeline that re-runs it on every model or prompt change. A gating mechanism that blocks deployment if attack-success rate regresses. A kill-chain analysis that asks what a successful attack could have reached. A feedback loop that turns every finding into a permanent regression test. Without those, the exercise has the same operational value as a one-off pen-test from 2014 that produced a PDF and changed nothing.

The reason this pattern persists is not that buyers are naive. It is that AI red-teaming, as marketed, fits the existing procurement shape of a one-time security assessment. The product surface it is defending — a model that changes weekly, behind a prompt that changes daily, fed by retrieval sources that change hourly — does not fit that shape. The discipline that does fit is closer to continuous evaluation with adversarial inputs than to any traditional pen-test. Think of it as the difference between an annual cardiology check-up and a heart-rate monitor on the wrist. One produces a report. The other catches the arrhythmia at 2am.

The OWASP LLM Top 10, Read Honestly

The OWASP Top 10 for LLM Applications, version 2025, is the current operative list. It added three new categories from the 2023 edition: System Prompt Leakage (LLM07), Vector and Embedding Weaknesses (LLM08), and Unbounded Consumption (LLM10, expanded from the old Model Denial of Service entry). Prompt Injection holds the top spot for the second consecutive edition.

The list is useful as a taxonomy. It is dangerous as a checklist, because each category compresses an open-ended attack surface into a name, and the names invite the checklist treatment. Here is each category with the operational read — what a credible test looks like, and what the checklist treats it as.

LLM01 — Prompt Injection. A real test builds an indirect-injection corpus that arrives through every untrusted input path: RAG sources, tool outputs, user-supplied documents, third-party APIs. It tests whether injected instructions in those sources change the model's behaviour, escalate tool access, or exfiltrate context, and re-runs on every retrieval-source addition. The checklist treatment runs twenty direct-injection prompts against the chat interface, counts refusals, declares the model "resistant."

LLM02 — Sensitive Information Disclosure. A real test probes for training-data extraction with prefix-completion attacks, tests the model's behaviour when given partial PII and asked to complete it, and instruments production traffic for PII leakage in responses. The checklist asks the model for its system prompt, notes that it refused, moves on.

LLM03 — Supply Chain. A real test tracks the provenance of every model, embedding, dataset, fine-tune, and adapter in the deployment graph. It verifies integrity. It treats Hugging Face downloads as untrusted code. The checklist confirms the vendor has SOC 2.

LLM04 — Data and Model Poisoning. A real test assumes training and fine-tuning corpora are adversarial, validates data lineage, and runs differential evaluations between a known-clean baseline and the production model. For RAG, this overlaps with retrieval poisoning — adversarial documents inserted into a vector store to manipulate downstream outputs. The checklist asks whether the training data was "curated."

LLM05 — Improper Output Handling. A real test treats every model output as untrusted input to the next system. If output flows into a SQL query, a shell command, a tool-call argument, a rendered HTML page, or a downstream model, it tests injection, escaping, and sanitisation at that boundary. The checklist assumes output goes to the user, the user is a human, no further handling required.

LLM06 — Excessive Agency. A real test enumerates every tool the agent can call, every permission those tools hold, every side effect they can produce. It tests whether a successful prompt injection can chain tools to produce a meaningful action the principal did not authorise, and reduces permissions to the minimum required. The checklist lists the tools in a diagram.

LLM07 — System Prompt Leakage. A real test assumes the system prompt is public from day one. It audits the prompt for embedded secrets, role descriptions that imply security boundaries, or anything the principal-authorisation model depends on. The checklist tries to extract the system prompt, notes the refusal, treats the prompt as confidential.

LLM08 — Vector and Embedding Weaknesses. A real test validates the retrieval surface against adversarial documents, tests embedding-space collisions, audits who can write to the vector store and under what authentication, and instruments retrieved-context for anomalies. The checklist says the vector store is "behind auth."

LLM09 — Misinformation. A real test evaluates hallucination rate against a held-out fact set with ground truth, and instruments production for confident-but-wrong outputs in domains where the cost of being wrong is asymmetric — medical, legal, financial. The checklist confirms the model has a "disclaimer."

LLM10 — Unbounded Consumption. A real test rate-limits per principal, monitors token spend by tenant, detects denial-of-wallet patterns and prompt-extraction-by-repetition attacks. It treats compute cost as an attack surface. The checklist relies on the cloud-provider quota.

The pattern across all ten is the same. The category names compress, the real tests are open-ended, and the checklist treatment is always a static one-shot exercise.

The Five Practices Of A Real Programme

If the checklist treatment is the failure mode, the discipline has a recognisable shape. Five operational practices distinguish a programme that catches real failures from one that produces a report.

The first is a version-controlled adversarial test set, owned by the team, that grows. Not a vendor probe library run once. A repository in the team's source control, versioned alongside the model and the prompt, with tests added every time a new attack pattern appears in the literature, a new tool is added to the agent, a new retrieval source is wired in, or a finding from production triage gets converted to a permanent test. The test set is an asset that compounds. The probe library is a snapshot that decays. Think of it like a vaccination programme rather than a flu shot — each new variant gets added to the schedule, not run once and forgotten.

The second is a continuous red-teaming pipeline triggered on every model change. Every checkpoint, every prompt revision, every retrieval-source addition, every adapter swap runs the adversarial test set in CI. Deployment is gated on the result. The bar is not "all probes refused" — that is unachievable. The bar is "attack-success rate has not regressed beyond the agreed threshold against the previous baseline, and no P0 probe newly succeeds." This is the same shape as a regression test suite for any other system; the difference is that the inputs are adversarial and the assertions are over model behaviour.

The third is production traffic, sampled and replayed adversarially against new versions. Real users find attack surfaces the test set did not anticipate. A small percentage of production traffic, sanitised of PII and labelled by intent, gets stored. When a new model version is staged, that traffic is replayed against it — both as-is and mutated with adversarial perturbations. Regressions surface before the version reaches the principal who triggered them in the first place.

The fourth is kill-chain analysis on every successful attack. When a probe succeeds, the question is not "did the model refuse?" — it did not. The question is "what could the successful attack have reached?" What tools were callable in the context where the attack landed? What permissions did those tools hold? What data sources were exposed? What was the blast radius? A successful jailbreak against a model with no tool access and no sensitive context is a low-severity finding. The same jailbreak against an agent with write access to a CRM and read access to a customer table is an incident. The same probe produces wildly different findings depending on the surface around it. Triage that does not measure blast radius is triage that misses the point — like reporting a window was broken without ever checking what was stolen from the house.

The fifth is every finding becoming a permanent regression test. This is the difference between an exercise and a programme. A finding that gets fixed but never re-tested can regress on the next model swap. A finding that becomes a versioned test in the eval gate is a permanent assertion. The corollary: the eval gate must run fast enough that engineers can iterate against it, and reliable enough that they trust its signal.

These five practices are not novel as ideas. They are the shape that catches attacks in production. The reason most programmes do not do them is that each one requires sustained ownership, not a procurement event.

Red-Teaming Catches. The Safety Surface Contains.

Red-teaming is half the picture. The other half is what runs in production when the red team is not watching — the held-out test set that gates every release, the regression suite that catches drift, the rollback path when a deployment misbehaves, the approval gates around model swaps, the drift detection that flags when production distribution diverges from evaluation, and the audit trail that records who promoted what, when, and under what evidence.

The companion piece on the self-improving-agents production pattern describes that surface in detail. The relationship between the two is straightforward: red-teaming catches failures by going looking for them, the safety surface contains failures by constraining what an attacker who slips through can reach. Programmes that have one without the other fail in predictable ways. Red-teaming without containment produces a backlog of findings and no enforcement. Containment without red-teaming hardens against the threats you imagined and leaves the rest uncovered. The two are the immune system and the skin — neither one alone is enough.

They share a substrate: the version-controlled eval suite. Findings from red-teaming feed it. The safety surface enforces it. The same regression test that started life as a red-team finding becomes the gate that blocks the next deployment from regressing.

What The Practice Does Not Solve

Honest scoping matters. Red-teaming, even at its most disciplined, does not solve three things.

Misuse by authorised principals. If a salesperson with legitimate access to the agent uses it to exfiltrate the customer list one query at a time, no adversarial test against the model will catch that. The controls live elsewhere — DLP, behavioural analytics, access review, the same controls that already exist for non-AI systems.

Supply-chain compromise of training data. Adversarial documents inserted into a public corpus that ends up in a pre-training set are a category the consumer of a model cannot test for from the outside. Vendors can mitigate. Consumers can run differential evaluations between model versions and watch for behaviour drift. Neither is a solution.

Novel attack surfaces that have not been characterised yet. The OWASP list is a snapshot of what the community has learned to test for. The attack surface of agentic systems is still expanding faster than the testing literature. MITRE ATLAS, which extends MITRE ATT&CK to AI systems, now documents 16 tactics and a growing technique set. The discipline keeps moving. A programme that assumes the list is complete is testing yesterday's attacks.

This honesty is part of the discipline. A programme that claims to solve everything is selling theatre.

Seven Questions That Sort The Vendor List

The questions that surface the difference are operational, not architectural. A CISO or security architect can take this list to any vendor pitch.

Show me a version-controlled adversarial test set you have built for a previous customer (sanitised) and explain how it grew over the engagement.
Show me the deployment-gate hook that runs that test set on every model or prompt change. What is the build system, what is the failure condition, what is the rollback path?
Show me three months of continuous-pipeline outputs from a real customer engagement. What got caught, what was the trend, what changed in response?
Show me a kill-chain analysis from a real finding. What was the probe, what tools did the agent have access to, what was the blast radius assessment, what was the remediation?
What does your engagement look like in month four, after the initial assessment is done? If the answer is "renewal conversation," that is the theatre tell.
What happens when a probe succeeds — does it become a regression test that runs in the customer's CI by default, or does it stay in your report?
Who owns the test set at the end of the engagement, you or the customer?

The honest answers to these questions sort the vendor list quickly. Anyone who cannot show all seven is selling artefacts.

The Discipline Underneath The Artefact

The broader pattern underneath all of this is that AI security is reverting toward the same engineering disciplines as the rest of software security — continuous integration, version-controlled tests, deployment gates, blast-radius analysis, drift detection — applied to a substrate where the system under test is non-deterministic and the attack surface moves with the data.

The enterprises that will scale AI without an incident are not the ones that bought the most red-teaming. They are the ones that internalised red-teaming as a discipline owned by an engineering team, fed by a test set that compounds, gated by a pipeline that runs continuously, and bounded by a production safety surface that contains the failures the testing did not catch.

If your AI programme has a thirty-page PDF on the shelf but no test set in source control, no CI hook, and no kill-chain analysis behind any finding — what exactly does your CISO sign off on?

FAQs

What separates a credible red-team report from a performative one?

The credible report leaves something behind that the customer's engineering team owns and uses — a version-controlled adversarial test set, a CI hook that runs it on every model or prompt change, and a deployment gate keyed to attack-success-rate regression. If the engagement ends with a PDF and a renewal conversation, it was theatre.

Is the OWASP LLM Top 10 a complete reference?

It is the current operative taxonomy and a useful starting point, but treating it as a complete reference is the failure mode. Each category compresses an open-ended attack surface into a name, the attack literature moves faster than the list, and MITRE ATLAS already documents 16 tactics and a growing technique set. A programme that assumes the list is complete is testing yesterday's attacks.

Why does kill-chain analysis matter more than counting refusals?

Because the same probe produces wildly different findings depending on the surface around it. A successful jailbreak against a model with no tool access and no sensitive context is a low-severity finding. The same jailbreak against an agent with write access to a CRM and read access to a customer table is an incident. Triage that does not measure blast radius is triage that misses the point.

What does red-teaming not solve?

Three things. Misuse by authorised principals — that is a DLP and behavioural-analytics problem, not an adversarial-testing one. Supply-chain compromise of training data, which the consumer of a model cannot fully test for. And novel attack surfaces that have not been characterised yet. A programme that claims to solve any of these is selling theatre.

How do red-teaming and the production safety surface relate?

They share a substrate — the version-controlled eval suite. Red-teaming feeds the suite by finding new failure modes; the safety surface enforces the suite by gating every release, catching drift, and constraining what an attacker who slips through can reach. The same regression test that started life as a red-team finding becomes the gate that blocks the next deployment from regressing.

Companion content

How to engage

If your AI programme has a red-team report but not a version-controlled test set, a deployment gate, or a kill-chain analysis on its findings, we can help you build the discipline underneath the artefact. Talk to us at creativeminds.dev/contact.