Compliance

AI Bill of Materials (AIBOM): SBOM for the Model, the Prompt, the RAG Corpus, and the Fine-Tune Adapters

Samuel A.16 min read
AI Bill of Materials (AIBOM): SBOM for the Model, the Prompt, the RAG Corpus, and the Fine-Tune Adapters
Share
~24 min

The Log4Shell lesson, three years later

In December 2021 the entire industry spent a fortnight discovering that nobody knew what software they were running. Log4j was buried four dependency layers deep in tens of thousands of production stacks, and the only honest answer most security teams could give to "are we exposed?" was "we are reading the source of every container image to find out." The Executive Order that followed in 2022, CISA's Minimum Elements for an SBOM in 2023, and the EU Cyber Resilience Act in 2024 did not invent the Software Bill of Materials. They made it the cost of doing business with any serious customer.

The same arc is now playing out for AI. The forcing incident has not yet hit at Log4Shell scale, but the regulatory clock has started. EU AI Act Article 11 — applicable to providers of high-risk systems from 2 August 2026 — requires technical documentation that includes the provenance of training and validation data. NIS2 imposes equivalent obligations on operators of essential and important entities that deploy AI in scope. CISA's evolving AI-SBOM guidance will land within 12 months. The question every regulated team has to answer in 2027 is not "do we have an AI inventory" but "show me, for this specific output on this specific date, which model, which adapter, which corpus snapshot, which prompt version, and which inference-time configuration produced it."

That artefact is the AI Bill of Materials. The vendors making the loudest noise are extending existing SBOM tooling and stopping at the model file. The model file is one of seven elements. This piece is the cmdev engineering reference for what the other six are, where each gets generated, and the architecture that holds them together.

Key takeaways

  • SBOM became regulatory after Log4Shell, and AIBOM is on the same arc — except the substrate is more fluid: model versions, fine-tune adapters, RAG corpus snapshots, prompt revisions, and inference-time configuration all change faster than any release cycle.
  • An AIBOM has seven elements — base model, fine-tune adapters, prompt versions, RAG corpus snapshot, embeddings model, evaluation evidence, and inference-time configuration — and stopping at the model file is the most common mistake.
  • EU AI Act Article 11 and the emerging CISA AI-SBOM guidance make the AIBOM the artefact that resolves the regulator's "show me what produced this output" question.
  • Generation is multi-stage: at training, at fine-tune promotion, at deployment, and — critically — at every inference, joined on a trace ID, with corpus snapshots referenced by content hash rather than copied.
  • CycloneDX has draft AI/ML BOM extensions and SLSA-for-ML is emerging, but in 2026 the AIBOM is something the engineering team builds, not buys. Vendor catch-up is 12 to 18 months out.
The seven elements of an AI Bill of Materials arranged around a central inference event — base model with version and checkpoint hash, fine-tune adapters with provenance, prompt versions with approval record, RAG corpus snapshot with chunk-level versioning, embeddings model identifier, evaluation evidence at promotion gate, and inference-time configuration including temperature, top-p, tool list, and MCP server set. Each element is joined to the inference event by a trace ID and stored in an append-only audit store with cryptographic integrity.
Figure 1 — Seven elements, one trace ID, one append-only audit store. The artefact a regulator will ask for in 2027.

The regulatory forcing function

EU AI Act Article 11 is the strongest instrument. From 2 August 2026, providers of high-risk AI systems must maintain technical documentation that demonstrates conformity, and Annex IV specifies the contents. Section 2(c) requires "the provenance of the training, validation, and testing data sets" including labelling and cleaning methodologies; Section 2(g) requires the description of validation and testing procedures and metrics. Neither can be satisfied for a RAG-backed system or an agent platform without an artefact that records which corpus, which evaluation suite, and which configuration were in force at the moment of operation. The Article 11 documentation is, in substance, an AIBOM with prose around it.

NIS2 Article 21 requires operators of essential and important entities to implement supply-chain security measures for ICT products and services. Where AI is in scope — anomaly detection for SCADA, automated triage for SOC, decision support in healthcare — the same provenance question applies, and Article 23's 24-hour early-warning clock makes the AIBOM the artefact that tells the response team what the system was at the time of the incident, not at the time they query.

CISA's evolving AI-SBOM work is the third instrument; the 2023 Minimum Elements pulled SBOM into procurement contracts, and the AI equivalent — being drafted by the cross-agency working group — tracks closely to the seven elements below. The fourth, still hypothetical but visible, is an AI Resilience Act or Article 11 expansion mandating AIBOM for AI components in product form. The pattern across all four is the same: the regulator does not ask for the AIBOM as a checkbox, the regulator asks the question the AIBOM exists to answer — what was the system at the moment it produced this output, and who approved each component.

The seven elements

An AIBOM is not a single document. It is the joined evidence of seven elements, each with its own generation point.

One: the base model. An identifier — Claude 4.7, Llama 3.3 70B Instruct, Mistral Large 2 — is not enough. The AIBOM records the model name, the provider's published version, the checkpoint hash where available, the license, and a reference to the training-data documentation. For hosted models the reference is the provider's model card and AUP; for open-weight models it is the dataset manifest.

Two: fine-tune adapters. Every adapter is a separate AIBOM entry — LoRA weights, full fine-tunes, PEFT artefacts, RLHF iterations — each with its own hash, the base it was trained from, the dataset, the training hyperparameters, the evaluation results that justified promotion, and the approver. Where adapters are stacked at inference time, the stack composition is recorded as data; a model that looks like one black box may be a base plus three adapters in the AIBOM.

Three: prompt versions. The system prompt is configuration in the same sense that an NGINX config or an IAM policy is configuration — it changes behaviour materially and is the most under-managed component in most AI deployments, edited inline or pasted into a config map without change control. The AIBOM records each prompt by hash, version, last-modified timestamp, approver, and change-management record. Where prompts are dynamic, the AIBOM records the template version and assembly rules separately from the assembled prompt, and stores the assembled prompt against the trace ID at inference time.

Four: RAG corpus snapshot. The hardest element. The corpus changes every day; a naive AIBOM that records "the corpus as of yesterday" is useless because yesterday's corpus is not the one that produced last Tuesday's answer. Every chunk is identified by its content hash, the corpus at any moment is the set of chunk hashes the vector store would return, and set membership is stored as a Merkle tree rooted at a corpus-state hash. An inference event records the corpus-state hash in force at retrieval; the audit query reconstructs which chunks were in the corpus at that moment without copying the corpus itself.

Five: the embeddings model. The embedder is a separate model with its own version. Re-embedding with a new version invalidates similarity comparisons against the old. The AIBOM records the embedder identifier, the version, and — for every chunk in the corpus — which embedder version produced its embedding. Mixed embeddings across versions in the same index are a finding.

Six: evaluation evidence. The eval suite that passed gate at promotion is the artefact that justifies the deployment. The AIBOM records the suite version, the test set hash, the per-metric results, the pass thresholds in force at promotion, and the approver. This maps directly onto SOC 2 Processing Integrity and AI Act Article 15, and is where the custom evaluation frameworks for enterprise LLMs work feeds into the AIBOM directly.

Seven: inference-time configuration. Temperature, top-p, max tokens, the tool list available to the model, the MCP server set, the guardrail policy, the safety filter version. These change per-inference and must be captured at the inference event itself, as attributes on the record joined to the other six elements by trace ID. Most teams either skip this entirely or capture it so coarsely that configuration drift between adjacent inferences becomes invisible.

The seven are independent in generation and cumulative in evidentiary weight. An AIBOM with elements one through three but skipping four, five, and seven is the AI equivalent of an SBOM that lists the binary but not the runtime dependencies — common, and not what the regulator is asking for.

What each element lets the auditor do

The base model and adapter entries together answer which model was running and whether the output came from the base or from an adaptation — for regulated workloads the fine-tune dataset is often the highest-sensitivity artefact in the entire AI estate. The prompt version entry answers what instructions the model was operating under at the moment of inference, where prompt injection, guardrail bypass, and unauthorised tool use investigations all converge. The corpus snapshot and embeddings model entries together resolve the dispute between "the model fabricated this" and "the model retrieved this from a corpus document that should not have been in scope," and let the security team assess whether a known embedder vulnerability exposes any chunk in the index. The evaluation evidence answers what was tested, what passed, and who signed off. The inference-time configuration distinguishes "the guardrail was misconfigured" from "the guardrail was disabled at the moment of incident," which is the question incident response on agent platforms always reaches.

The standards landscape, honestly

The standards picture in mid-2026 is genuinely immature.

CycloneDX is furthest along. Version 1.6 added an mlAnalysis component type capturing model name, version, hash, and dataset references; the draft 1.7 specification adds fine-tune lineage and evaluation evidence. It is the safest format to standardise on now — tooling integrates into CI pipelines, and the OWASP working group is active. It covers elements one, two, and six well, three and seven adequately with custom properties, four and five thinly.

SPDX is exploring an AI profile through SPDX 3 — more rigorous schema but lagging tooling. SPDX 3 is what AI Act technical documentation is most likely to converge on for the European market, on the trajectory SPDX took for the CRA, but two years out. In-toto attestations cover AIBOM loosely; in-toto envelopes carrying CycloneDX payloads signed with Sigstore is the emerging supply-chain pattern. SLSA-for-ML is the youngest and most promising framework — the ML extension addresses training-pipeline integrity, dataset provenance, and model promotion gates.

The honest reading: standardise on CycloneDX, design the data model around the seven elements regardless of what the format natively supports, and be ready to remap onto SPDX or in-toto when the procurement requirement lands. The data model is what matters; the format is interchange.

Where the AIBOM gets generated

Generation is multi-stage; a single-stage AIBOM produced only at deployment misses most of the evidence the regulator will ask for.

At training. The training pipeline emits an AIBOM entry for elements two and six. The easiest stage to instrument — pipelines already have lineage tooling (MLflow, Weights & Biases, SageMaker Model Registry, Vertex AI), and the emitter is a hook on the existing artefact tracking.

At fine-tune promotion. Promotion from staging to production signs the AIBOM entry and adds it to the registry. Where the eval-gate decision is captured as data — which suite version ran, which thresholds were configured, which results passed, which human approved.

At deployment. The deployment event composes elements one, two, three, five, and six into a deployment AIBOM — the artefact a procurement team will ask for in a vendor questionnaire and an internal change-advisory board will review before approving the deployment.

At inference, on every request. The stage most teams skip and the one the regulator's most concrete question requires. Every inference event emits a record carrying the deployment AIBOM reference (resolving elements one through three, five, and six by hash), the corpus-state hash at retrieval (element four), and the inference-time configuration as attributes (element seven). The record is joined to the principal-of-record chain — see our piece on IAM for AI agents — by the trace ID.

The four-stage discipline lets a single query against the audit store reconstruct the system as it was at the moment of any specific output. Three stages and you are reconstructing from correlation; two and you are guessing.

Storage, retention, and the hard parts

The store is append-only with cryptographic integrity — each entry is hashed, the hashes are chained, and the chain root is published to a tamper-evident log (AWS QLDB, Azure Confidential Ledger, or a self-hosted Sigstore-backed transparency log). EU AI Act Article 12 requires automatically generated logs retained for at least six months from generation; practical retention lands at five years for high-risk systems in regulated sectors. Every inference record carries the trace ID that originated at the user's authentication event, so the audit query that reconstructs the system at the moment of inference is the same query that reconstructs the human who caused it.

Three problems break naive implementations.

Corpus versioning at scale. A 10-million-chunk corpus that changes 0.1 per cent per day churns 10,000 chunks daily; naive snapshotting copies the corpus and does not scale. Content-hash addressing with a Merkle root means corpus state is identified by a single hash, deltas are stored as chunk-hash diffs from the previous root, and storage grows with churn rather than corpus size. The pattern Git uses, rarely applied to vector stores.

Prompt-version explosion. A team running A/B tests, dynamic prompts from user context, and per-tenant adaptations produces thousands of effective prompts per day. The discipline: record template versions as first-class AIBOM entries (a manageable number), assembly rules as code, and the assembled prompt as data on the inference event itself, joined by trace ID.

Inference-time configuration drift. Drift between adjacent inferences happens because temperature, tool lists, and guardrail policies are read from config services that may have rolled out a change mid-batch. The AIBOM must capture the configuration as the model actually received it, not as the config service claimed. The record is emitted from the inference path, not the config layer.

The reference implementation pattern

Strip away the format debates and the AIBOM reference architecture is five components and one principle. The principle: every component is queryable by trace ID, and the audit query is a join, not a reconstruction.

A model registry holds base models, fine-tune adapters, and the lineage between them — MLflow, AWS Bedrock Model Registry, SageMaker Model Registry, or Vertex AI Model Registry. A corpus snapshot store built on object storage with content-hash addressing — S3, GCS, or Azure Blob with object lock and versioning, plus a Merkle root computed and stored alongside each ingest event. A prompt registry for templates and assembly rules, with no widely adopted off-the-shelf product in 2026 — teams are building on top of Git, with LangSmith, PromptLayer, and Braintrust offering parts of this but none yet meeting a procurement-grade audit profile.

An evaluation registry for suite versions, test set hashes, and per-metric results — LangSmith, Braintrust, or custom on top of MLflow Tracking. An inference-time logger built on OpenTelemetry, where every inference emits a span with the deployment AIBOM reference, the corpus-state hash, the assembled prompt hash, the embedder version, and the inference-time configuration as span attributes. The trace ID is the join key across all five components.

In 2026 these are systems to build, not products to buy. Vendors will close this gap over the next 18 months; teams that built the data model around the seven elements will plug into vendor tooling at the interchange layer. The teams that did not will rebuild.

The procurement reality

There is no off-the-shelf AIBOM platform in 2026 that covers the seven elements end-to-end. The vendors closest — Anchore, Snyk, Wiz — extend existing SBOM tooling into element one and partly into element two; they do not address elements three through seven at any procurement-grade depth. The MLOps platforms — MLflow, SageMaker, Vertex AI, Azure ML — address elements one, two, and six within their walled gardens, producing AIBOM-shaped artefacts that are not yet portable across stacks.

Any team that needs an AIBOM in 2026 builds it. The build is weeks of engineering, not quarters, but the design choices around content-hash addressing, trace ID propagation, and append-only storage are the differentiators. Vendors will catch up on the easy parts within 12 months and on the harder parts within 18 to 24. Until then, the engineering reference is the differentiator.

What this teaches us about regulated AI

The shape of every AI compliance regime landing in 2026 and 2027 is the same. The regulator does not ask whether you bought the right platform; the regulator asks a specific question about a specific output on a specific date, and the answer has to come from a single audit query against a single audit store. SOC 2 evidence, AI Act Article 11 documentation, NIS2 incident reports, and the procurement questionnaires that flow from all three converge on the same artefact.

That artefact is the AIBOM. The teams that have it on day one of an audit in 2027 win the audit in days. The teams that do not either delay deployment until they retrofit it — under regulatory pressure, the most expensive engineering work there is — or fail the inspection.

The deeper lesson applies to every load-bearing audit artefact. The AIBOM is not a document the compliance team writes after the fact. It is data the engineering team emits at the moment of every relevant event, joined on a trace ID, and stored under cryptographic integrity. The compliance team queries it; the engineering team makes it exist. That boundary is the boundary between AI deployments that survive regulatory scrutiny and the ones that retrofit under it.

FAQs

What is the shortest accurate definition of an AIBOM?

The joined, versioned, cryptographically integral record of the seven elements that produced a specific AI output: base model, fine-tune adapters, prompt versions, RAG corpus snapshot, embeddings model, evaluation evidence, and inference-time configuration. Stopping at the model file is the most common mistake.

Is AIBOM legally required yet?

Not by name. EU AI Act Article 11 and Annex IV require technical documentation of training and validation data provenance for high-risk systems from 2 August 2026, which is an AIBOM in substance. NIS2 Article 21 imposes equivalent supply-chain obligations on essential and important entities. CISA's AI-SBOM minimum elements are in draft and expected within 12 months. The artefact is required; the name is converging.

Which standard format should we ship?

CycloneDX 1.6, with the data model designed around the seven elements regardless of what the format natively supports. Be ready to add SPDX 3 with the AI profile when European procurement asks for it. In-toto attestation envelopes signed with Sigstore are the emerging pattern for tamper-evident interchange.

How do we handle the RAG corpus dimension at scale?

Content-hash addressing with a Merkle root over the corpus membership manifest. Every chunk is identified by its content hash; the corpus state at any moment is identified by a single root hash; deltas between roots are stored as chunk-hash diffs. The query "which chunks were in the corpus at time T" reconstructs from a checkpoint root plus replayed deltas, and storage cost grows with churn rather than corpus size.

Do we have to emit an AIBOM record at every inference?

For high-risk workloads under AI Act Articles 11 and 12, yes — the inference-time record is what lets the audit query reconstruct the system as it was at the moment of any specific output. The cost is manageable when the record is a span attribute on the existing OpenTelemetry trace and the heavy elements (corpus snapshot, model file) are referenced by hash rather than copied.

Companion content

How to engage

If you are building or operating a high-risk AI system that has to survive an EU AI Act Article 11 inspection — or a procurement questionnaire that asks for AIBOM by name — and you want a reference implementation against the seven elements and the architecture above, talk to us at creativeminds.dev/contact.

aibomsbomai-complianceeu-ai-actarticle-11cyclonedxspdxsupply-chainai-governanceregulated-enterprise

Ready to strengthen your security posture?

We help organizations across Africa build resilient infrastructure, deploy AI at scale, and navigate complex regulatory environments.

Start a conversation