Engineering

AI Red-Teaming as a Discipline: What the Practice Actually Looks Like, and Why Most of It Is Theatre

Samuel A.14 min read
AI Red-Teaming as a Discipline: What the Practice Actually Looks Like, and Why Most of It Is Theatre
Share
~21 min

The marketing line has eaten the discipline

Every security vendor with a product datasheet now offers "AI red-teaming." Every enterprise AI programme has a slide that says it was red-teamed. Procurement teams ask whether a vendor red-teams its models, get a confident yes, and move on.

The result, if you watch how the work actually lands inside a security organisation, is a pattern that has nothing to do with adversarial testing as a discipline. A vendor runs a fixed set of prompt-injection probes against the model. They produce a PDF. The PDF lists the probes the model refused and the probes it didn't. The customer's CISO signs off because the report exists. The findings don't become regression tests. The next model version isn't tested against the same probes. Nothing changes in production. The exercise was performed, not operationalised.

This is not red-teaming. It is the artefact of red-teaming without the practice. The distinction matters because the gap between the two is where real attacks land — the indirect prompt injection that arrives through a RAG source three months after launch, the tool-call that gets coerced into a side effect nobody scoped, the embedding that retrieves the wrong document at the wrong moment for the wrong principal. None of these get caught by a one-shot probe list.

This article does two things. First, it reads the OWASP LLM Top 10 (2025 edition, currently the operative version) honestly — what a real test looks like for each category, what a checklist treats it as. Second, it describes what a credible AI red-team programme actually does in production, and the buyer questions that separate the discipline from the theatre.

Key takeaways

  • Performative red-teaming runs a one-shot probe library, produces a PDF, and leaves no regression tests behind. It is the artefact of red-teaming without the practice, and it has the same operational value as a 2014 pen-test that changed nothing.
  • The OWASP LLM Top 10 (2025) is useful as a taxonomy and dangerous as a checklist; each category compresses an open-ended attack surface into a name that invites a static, one-shot exercise.
  • Five operational practices define a credible programme: a version-controlled adversarial test set the team owns, a continuous pipeline gating every model change, production traffic replay against new versions, kill-chain analysis on every successful attack, and every finding becoming a permanent regression test.
  • Red-teaming catches by going looking; the production safety surface contains by constraining what an attacker who slips through can reach. Programmes that have one without the other fail in predictable ways.
  • The seven buyer questions sort vendors quickly — anyone who cannot show a customer-owned test set, a deployment-gate hook, three months of pipeline output, a real kill-chain analysis, and a month-four engagement shape is selling artefacts.

What performative red-teaming looks like

The performative version has a predictable shape. A vendor or internal team takes a public probe library (Garak, PyRIT, a handful of jailbreak prompts circulating on Twitter) and runs it against the model behind the customer's deployment. They count refusals. They highlight the prompts the model fell for. They write recommendations that read like "consider adding additional system-prompt guidance" or "implement output filtering." The report is delivered, the engagement closes.

What's missing is what would make the work matter: a version-controlled test set the customer owns, a pipeline that re-runs it on every model or prompt change, a gating mechanism that blocks deployment if attack-success rate regresses, a kill-chain analysis that asks what a successful attack could have reached, and a feedback loop that turns every finding into a permanent regression test. Without those, the exercise has the same operational value as a one-off pen-test from 2014 that produced a PDF and changed nothing.

The reason this pattern persists isn't that buyers are naive. It's that AI red-teaming, as marketed, fits the existing procurement shape of a one-time security assessment. The product surface it's defending — a model that changes weekly, behind a prompt that changes daily, fed by retrieval sources that change hourly — does not fit that shape. The discipline that does fit is closer to continuous evaluation with adversarial inputs than to any traditional pen-test.

The OWASP LLM Top 10, read honestly

The OWASP Top 10 for LLM Applications, version 2025, is the current operative list. It added three new categories from the original 2023 edition: System Prompt Leakage (LLM07), Vector and Embedding Weaknesses (LLM08), and Unbounded Consumption (LLM10, expanded from the old Model Denial of Service entry). Prompt Injection holds the top spot for the second consecutive edition.

The list is useful as a taxonomy. It is dangerous as a checklist, because each category compresses an open-ended attack surface into a name, and the names invite the checklist treatment. Here is each category with the operational read: what a credible test looks like, and what the checklist treats it as.

LLM01 — Prompt Injection. Real test: build an indirect-injection corpus that arrives through every untrusted input path — RAG sources, tool outputs, user-supplied documents, third-party APIs. Test whether injected instructions in those sources change the model's behaviour, escalate tool access, or exfiltrate context. Re-run on every retrieval source addition. Checklist treatment: run twenty direct-injection prompts against the chat interface, count refusals, declare the model "resistant."

LLM02 — Sensitive Information Disclosure. Real test: probe for training-data extraction with prefix-completion attacks, test the model's behaviour when given partial PII and asked to complete it, instrument production traffic for PII leakage in responses. Checklist treatment: ask the model for its system prompt, note that it refused, move on.

LLM03 — Supply Chain. Real test: track the provenance of every model, embedding, dataset, fine-tune, and adapter in the deployment graph. Verify integrity. Treat Hugging Face downloads as untrusted code. Checklist treatment: confirm the vendor has SOC 2.

LLM04 — Data and Model Poisoning. Real test: assume training and fine-tuning corpora are adversarial; validate data lineage; run differential evaluations between a known-clean baseline and the production model. For RAG, this overlaps with retrieval poisoning — adversarial documents inserted into a vector store to manipulate downstream outputs. Checklist treatment: ask whether the training data was "curated."

LLM05 — Improper Output Handling. Real test: treat every model output as untrusted input to the next system. If output flows into a SQL query, a shell command, a tool-call argument, a rendered HTML page, or a downstream model, test injection, escaping, and sanitisation at that boundary. Checklist treatment: output goes to the user, the user is a human, no further handling required.

LLM06 — Excessive Agency. Real test: enumerate every tool the agent can call, every permission those tools hold, every side effect they can produce. Test whether a successful prompt injection can chain tools to produce a meaningful action the principal didn't authorise. Reduce permissions to the minimum required. Checklist treatment: list the tools in a diagram.

LLM07 — System Prompt Leakage. Real test: assume the system prompt is public from day one. Audit it for embedded secrets, role descriptions that imply security boundaries, or anything the principal-authorisation model depends on. Checklist treatment: try to extract the system prompt, note the refusal, treat the prompt as confidential.

LLM08 — Vector and Embedding Weaknesses. Real test: validate the retrieval surface against adversarial documents; test embedding-space collisions; audit who can write to the vector store and under what authentication; instrument retrieved-context for anomalies. Checklist treatment: the vector store is "behind auth."

LLM09 — Misinformation. Real test: evaluate hallucination rate against a held-out fact set with ground truth, instrument production for confident-but-wrong outputs in domains where the cost of being wrong is asymmetric (medical, legal, financial). Checklist treatment: confirm the model has a "disclaimer."

LLM10 — Unbounded Consumption. Real test: rate-limit per principal, monitor token spend by tenant, detect denial-of-wallet patterns and prompt-extraction-by-repetition attacks. Treat compute cost as an attack surface. Checklist treatment: rely on the cloud-provider quota.

The pattern across all ten is the same. The category names compress, the real tests are open-ended, and the checklist treatment is always a static one-shot exercise.

What a credible programme actually does

If the checklist treatment is the failure mode, the discipline has a recognisable shape. Five operational practices distinguish a programme that catches real failures from one that produces a report.

One — a version-controlled adversarial test set, owned by the team, that grows. Not a vendor probe library run once. A repository in the team's source control, versioned alongside the model and the prompt, with tests added every time a new attack pattern appears in the literature, a new tool is added to the agent, a new retrieval source is wired in, or a finding from production triage gets converted to a permanent test. The test set is an asset that compounds. The probe library is a snapshot that decays.

Two — a continuous red-teaming pipeline triggered on every model change. Every checkpoint, every prompt revision, every retrieval-source addition, every adapter swap runs the adversarial test set in CI. Deployment is gated on the result. The bar is not "all probes refused" — that's unachievable. The bar is "attack-success rate has not regressed beyond the agreed threshold against the previous baseline, and no P0 probe newly succeeds." This is the same shape as a regression test suite for any other system; the difference is that the inputs are adversarial and the assertions are over model behaviour.

Three — production traffic sampled and replayed adversarially against new versions. Real users find attack surfaces the test set didn't anticipate. A small percentage of production traffic, sanitised of PII and labelled by intent, gets stored. When a new model version is staged, that traffic is replayed against it — both as-is, and mutated with adversarial perturbations. Regressions surface before the version reaches the principal who triggered them in the first place.

Four — kill-chain analysis on every successful attack. When a probe succeeds, the question is not "did the model refuse?" — it didn't. The question is "what could the successful attack have reached?" What tools were callable in the context where the attack landed? What permissions did those tools hold? What data sources were exposed? What was the blast radius? A successful jailbreak against a model with no tool access and no sensitive context is a low-severity finding. The same jailbreak against an agent with write access to a CRM and read access to a customer table is an incident. The same probe produces wildly different findings depending on the surface around it. Triage that doesn't measure blast radius is triage that misses the point.

Five — every finding becomes a permanent regression test. This is the difference between an exercise and a programme. A finding that gets fixed but never re-tested can regress on the next model swap. A finding that becomes a versioned test in the eval gate is a permanent assertion. The corollary: the eval gate must run fast enough that engineers can iterate against it, and reliable enough that they trust its signal.

These five practices are not novel as ideas. They are the shape that catches attacks in production. The reason most programmes don't do them is that each one requires sustained ownership, not a procurement event.

Red-teaming catches, the safety surface contains

Red-teaming is half the picture. The other half is what runs in production when the red team isn't watching: the held-out test set that gates every release, the regression suite that catches drift, the rollback path when a deployment misbehaves, the approval gates around model swaps, the drift detection that flags when production distribution diverges from evaluation, and the audit trail that records who promoted what, when, and under what evidence.

The companion piece on the self-improving-agents production pattern describes that surface in detail. The relationship between the two is straightforward: red-teaming catches failures by going looking for them, the safety surface contains failures by constraining what an attacker who slips through can reach. Programmes that have one without the other fail in predictable ways. Red-teaming without containment produces a backlog of findings and no enforcement. Containment without red-teaming hardens against the threats you imagined and leaves the rest uncovered.

The two practices share a substrate: the version-controlled eval suite. Findings from red-teaming feed it. The safety surface enforces it. The same regression test that started life as a red-team finding becomes the gate that blocks the next deployment from regressing.

What red-teaming does not solve

Honest scoping matters. Red-teaming, even at its most disciplined, does not solve:

Misuse by authorised principals. If a salesperson with legitimate access to the agent uses it to exfiltrate the customer list one query at a time, no adversarial test against the model will catch that. The controls live elsewhere — DLP, behavioural analytics, access review, the same controls that already exist for non-AI systems.

Supply-chain compromise of training data. Adversarial documents inserted into a public corpus that ends up in a pre-training set are a category the consumer of a model cannot test for from the outside. Vendors can mitigate. Consumers can run differential evaluations between model versions and watch for behaviour drift. Neither is a solution.

Novel attack surfaces that haven't been characterised yet. The OWASP list is a snapshot of what the community has learned to test for. The attack surface of agentic systems is still expanding faster than the testing literature. MITRE ATLAS, which extends MITRE ATT&CK to AI systems, now documents 16 tactics and a growing technique set. The discipline keeps moving. A programme that assumes the list is complete is testing yesterday's attacks.

This honesty is part of the discipline. A programme that claims to solve everything is selling theatre.

Buyer questions for a vendor selection

The questions that surface the difference are operational, not architectural. A CISO or security architect can take this list to any vendor pitch:

  1. Show me a version-controlled adversarial test set you've built for a previous customer (sanitised) and explain how it grew over the engagement.
  2. Show me the deployment-gate hook that runs that test set on every model or prompt change. What's the build system, what's the failure condition, what's the rollback path?
  3. Show me three months of continuous-pipeline outputs from a real customer engagement. What got caught, what was the trend, what changed in response?
  4. Show me a kill-chain analysis from a real finding. What was the probe, what tools did the agent have access to, what was the blast radius assessment, what was the remediation?
  5. What does your engagement look like in month four, after the initial assessment is done? If the answer is "renewal conversation," that's the theatre tell.
  6. What happens when a probe succeeds — does it become a regression test that runs in the customer's CI by default, or does it stay in your report?
  7. Who owns the test set at the end of the engagement, you or the customer?

The honest answers to these questions sort the vendor list quickly. Anyone who can't show all seven is selling artefacts.

What this teaches us about enterprise scaling

The broader pattern underneath all of this is that AI security is reverting toward the same engineering disciplines as the rest of software security — continuous integration, version-controlled tests, deployment gates, blast-radius analysis, drift detection — applied to a substrate where the system under test is non-deterministic and the attack surface moves with the data.

The enterprises that will scale AI without an incident are not the ones that bought the most red-teaming. They are the ones that internalised red-teaming as a discipline owned by an engineering team, fed by a test set that compounds, gated by a pipeline that runs continuously, and bounded by a production safety surface that contains the failures the testing didn't catch.

The ones that won't scale will produce a lot of reports.

FAQs

What separates a credible red-team report from a performative one?

The credible report leaves something behind that the customer's engineering team owns and uses — a version-controlled adversarial test set, a CI hook that runs it on every model or prompt change, and a deployment gate keyed to attack-success-rate regression. If the engagement ends with a PDF and a renewal conversation, it was theatre.

Is the OWASP LLM Top 10 a complete reference?

It is the current operative taxonomy and a useful starting point, but treating it as a complete reference is the failure mode. Each category compresses an open-ended attack surface into a name, the attack literature moves faster than the list, and MITRE ATLAS already documents 16 tactics and a growing technique set. A programme that assumes the list is complete is testing yesterday's attacks.

Why does kill-chain analysis matter more than counting refusals?

Because the same probe produces wildly different findings depending on the surface around it. A successful jailbreak against a model with no tool access and no sensitive context is a low-severity finding. The same jailbreak against an agent with write access to a CRM and read access to a customer table is an incident. Triage that does not measure blast radius is triage that misses the point.

What does red-teaming not solve?

Three things. Misuse by authorised principals — that is a DLP and behavioural-analytics problem, not an adversarial-testing one. Supply-chain compromise of training data, which the consumer of a model cannot fully test for. And novel attack surfaces that have not been characterised yet. A programme that claims to solve any of these is selling theatre.

How do red-teaming and the production safety surface relate?

They share a substrate — the version-controlled eval suite. Red-teaming feeds the suite by finding new failure modes; the safety surface enforces the suite by gating every release, catching drift, and constraining what an attacker who slips through can reach. The same regression test that started life as a red-team finding becomes the gate that blocks the next deployment from regressing.

Companion content

How to engage

If your AI programme has a red-team report but not a version-controlled test set, a deployment gate, or a kill-chain analysis on its findings, we can help you build the discipline underneath the artefact. Talk to us at creativeminds.dev/contact.

ai-red-teamingai-securityowasp-llm-top-10adversarial-testingprompt-injectionproduction-aiai-governanceperspective

Ready to strengthen your security posture?

We help organizations across Africa build resilient infrastructure, deploy AI at scale, and navigate complex regulatory environments.

Start a conversation