Federated AI: When to Keep Models On-Prem, When to Send Data to the Frontier

Key takeaways

Data residency is now the constraint that drives the rest of the AI architecture — NDPA Section 43, GDPR plus EU AI Act Article 10 (enforceable August 2026), and CBN CSAT have closed off the default "send everything to the frontier" pattern for a growing share of regulated workloads.
"Federated AI" is being used as a marketing umbrella for four technically distinct patterns: federated learning (training), on-prem foundation models, hybrid sanitisation, and edge inference. Their legal posture, cost structure, and operational complexity are different.
The economic case for on-prem is rarely "it is cheaper at scale" — frontier price-per-token has fallen ~90% per year for three years. The case is that the data legally cannot leave, so the cost question is on-prem versus not building it at all.
For most regulated enterprises the architecture that audits clean is hybrid: on-prem PII tokenisation and redaction on a 7-13B classifier, frontier inference on sanitised input, re-hydration inside the trust boundary, audit trail covering both surfaces.
The mental model that wins is that the model is a component, not a destination. The architecture is the policy, sanitisation, routing, audit, and residency layer. The model — frontier, on-prem, edge, federated — fits inside that architecture, not the other way around.

Figure 1 — Two zones, one classifier, the trust boundary that decides whether this works

The constraint that now drives the architecture

For most of the last three years, the default enterprise AI architecture was a single sentence: pick a frontier model, send the prompt over HTTPS, parse the response. Everything downstream — retrieval, orchestration, guardrails, evaluation — assumed that the model itself was somewhere else, on someone else's hardware, under someone else's jurisdiction.

That default has quietly stopped working for a large class of customers. The EU AI Act's high-risk obligations under Article 10 become enforceable in August 2026, with documented dataset governance and bias mitigation requirements that are awkward to evidence when the model and its training data live behind a vendor API. Nigeria's NDPA Section 43 prohibits cross-border transfer of personal data unless the receiving country has been certified adequate by the NDPC, and very few countries currently are. The CBN's Risk-Based Cybersecurity Framework for Deposit Money Banks now requires regulated institutions to evidence control over data flows. Sector-specific defence-adjacent work — energy, telecoms, anything touching critical infrastructure — has its own layer on top.

The result is that the architecture conversation now starts with a question it used to end with: where is the data allowed to be when the model sees it? The four patterns below are the honest answer set. None of them is universally right, and the marketing language around them is confused enough that we want to start by separating them.

What "federated AI" actually means in 2026

"Federated AI" is being used as a marketing umbrella for four technically distinct things. The conflation matters because the legal posture, the cost structure, and the operational complexity of each is different.

Federated learning is the original 2017 meaning: a model is trained across decentralised data, only encrypted weight updates leave each node, and a coordinator aggregates them into a global model. The raw data never moves. This is a training pattern, not an inference pattern. It is what you do when you cannot pool the training data — across hospitals, across banks, across regulatory zones.

On-prem foundation models are pre-trained open-weight models — Llama, Mistral, Qwen, DeepSeek — running on customer-controlled hardware. The weights are static; the customer hosts the inference. No training happens unless the customer chooses to fine-tune. This is the pattern most enterprises mean when they say "we want our own model."

Hybrid architectures keep sensitive preprocessing — PII redaction, classification, retrieval over confidential corpora — on-prem, then call a frontier model in the cloud on sanitised input. The frontier model never sees the raw data. The customer wears the data-residency obligation on the on-prem side and gets frontier capability on the cloud side.

Edge inference runs small models — 500M to 7B parameters — on the device that produces the data: a phone, a sensor, a vehicle, a branch-office appliance. Latency and offline operation, not residency, are usually the primary drivers, though residency is a clean side-effect.

These four patterns share the property that data does not flow to a third-party frontier model. They do not share much else. We treat them separately below.

The four real patterns

Pattern 1: On-prem foundation models

The serious open-weight models have closed enough of the gap to make this pattern viable for most production workloads. Llama family models in MoE configurations run at INT4 quantisation on single H100 80GB cards; larger variants need 8× H100s for production-grade throughput. Mistral Small (Apache 2.0) consolidates earlier products into a single deployment with configurable reasoning effort. Qwen ships in sizes from sub-1B to 397B, all Apache 2.0, all natively multimodal. DeepSeek is the reasoning-per-dollar leader.

The serving layer is mature. vLLM is the de facto production runtime — paged attention, continuous batching, tensor parallelism across multiple GPUs. NVIDIA NIM packages a model, an OpenAI-compatible API server, and an optimised inference engine into a single Docker container. AWS Outposts and the newer AI Factories offering drop dedicated AWS hardware into the customer's data centre and expose SageMaker and Bedrock surfaces on top.

The honest constraint is that this is datacenter-scale infrastructure. A single H200 costs $30,000 to $40,000 to buy outright, or roughly $2.50 to $7.00 per GPU-hour on-demand; an H100 is $25,000+ to buy and around $3.11/GPU-hour on average. An 8×H200 rack is $315,000 of GPU alone, before networking, power, and a place to put it. You also need the operational competence — driver versions, CUDA, NVIDIA Container Toolkit, monitoring, model swap procedures. This is not a weekend project.

Pattern 2: Hybrid (on-prem sanitisation, frontier reasoning)

The hybrid pattern is the architecture that ships most often in regulated enterprises. The shape is: an on-prem service receives the raw input, classifies and tokenises sensitive fields, runs the redacted payload against a frontier model, then re-hydrates the response by mapping tokens back to original values inside the trust boundary.

The production version uses three-layer PII detection: regex with checksums for structured identifiers, transformer-based NER (DeBERTa or piranha-style models) for unstructured text, and an LLM-judge pass for edge cases. Tokenisation is tiered: salted hash for one-way comparison, deterministic HMAC for joinable identifiers, NIST 800-38G FF1 format-preserving encryption for fields that need to look real, vaulted tokenisation for fields where the original must come back. Policy enforcement is graduated — allow, warn, block, or route to an approved internal model — rather than a hard block on every match.

The legal posture is the appeal. The frontier model only ever sees synthetic or pseudonymised data; the raw record stays in the regulated jurisdiction. The customer can answer the regulator's question about cross-border transfer with "we transfer tokenised data only, and the mapping vault is in [Frankfurt / Lagos / Stockholm]." The cost posture is favourable too — frontier API calls are dominated by the prompt size, and a sanitised prompt is often shorter than the original.

The cost of being honest about this pattern is that it is not free. You are running an additional inference stack — typically a 7B-13B classifier on a single GPU — and a vault. You are accepting that some queries will produce worse answers because they were over-redacted. And you are now responsible for the redaction model's failure modes, which the regulator will ask about.

Pattern 3: Edge inference

Edge inference is the pattern when the constraint is latency or connectivity rather than residency, though residency comes along for the ride. The model is small enough to run on the device producing the data: a phone, a vehicle telemetry unit, an industrial sensor, a branch-office appliance.

The current baseline is that a Jetson Orin NX runs Phi-class models at INT4 quantisation with sub-100ms inference latency. A 500M-2B parameter model fits in 1-4GB of RAM and runs on consumer mobile silicon. Quantisation from FP32 to INT4 typically reduces model size 8× with a few percentage points of quality loss on classification and short-form generation, which is acceptable for the workloads that justify edge in the first place: command interpretation, classification, simple summarisation, anomaly detection.

The cases where edge wins are specific. Real-time decisions where a 300ms round-trip is unacceptable — factory-floor quality control, voice interfaces, fraud scoring on a payment terminal. Connectivity-constrained environments — rural branches, vehicles, vessels. Workloads where the volume is high enough that per-call cloud inference would be uneconomic even at frontier prices.

The cases where edge loses are also specific. Anything requiring frontier reasoning, long context, or recent world knowledge. Anything where the model needs to be updated frequently — pushing weights to a fleet of devices is its own logistics problem. Anything where the input is genuinely small and rare; the engineering tax is not worth it for a hundred queries a day.

Pattern 4: Federated learning

Federated learning is the narrowest of the four patterns and also the most overhyped. The genuine production use cases are concentrated in healthcare (training across hospital data without pooling), banking fraud (training across institutions without sharing transaction logs), and certain manufacturing telemetry use cases. Production frameworks now combine federated learning with differential privacy, secure multi-party computation, homomorphic encryption, or trusted execution environments to harden the weight-update channel against reconstruction attacks.

The pattern is real, and it works, but it is a training-time architecture. It does not solve the inference-time data-residency problem at all — once you have the global model, you still have to host it somewhere, and that hosting decision is one of the other three patterns. We see federated learning used most often as a complement to on-prem foundation models: train federally across regulated nodes, then deploy the resulting model on-prem at each node.

If you are reaching for federated learning and your problem is not "we need to train across data that legally cannot be pooled," you have probably reached for the wrong pattern.

What the regulators actually require

The regulatory drivers are specific enough that they reward being specific in return.

NDPA (Nigeria) prohibits transfer of personal data to a foreign jurisdiction unless that jurisdiction is on the NDPC adequacy list, or unless an enforceable contractual safeguard (typically a Data Processing Agreement) is in place. The adequacy list is short, so most Nigerian enterprises operate under DPAs that include transfer-specific protections. Penalties run up to ₦10 million or 2% of annual gross revenue, whichever is higher.

GDPR + EU AI Act for high-risk AI systems requires documented data governance under Article 10: design choices, data collection and origin, preparation, assumptions, availability, bias examination, bias mitigation, and known gaps — all maintained throughout the system's lifecycle. Enforcement against high-risk systems begins 2 August 2026. Fines reach €35M or 6% of global turnover. The practical implication is that if you cannot evidence the training-data lineage of the model serving your inference, you cannot use it for a high-risk application.

CBN CSAT (Nigeria, banking) requires Nigerian deposit money banks and payment service banks to implement and evidence cybersecurity controls. Data flow control is a first-class control objective.

Sector-specific defence and energy rules vary by jurisdiction but cluster around the same requirement: certain workloads must run inside a defined trust boundary, with auditable evidence of who touched the data and where the inference occurred.

The pattern across all four is that regulators are no longer satisfied with vendor attestation. They want the customer to evidence the architecture itself.

The honest cost trade-off

The on-prem-versus-frontier economic question has a definite answer, and the crossover point depends on the volume.

Consider a workload running an open-weight INT4 model on a single H100 versus calling a frontier model ($5/$30 per million input/output tokens). An H100 server with networking, power, and ops overhead is roughly $20,000-30,000 per month all-in if you own it, or $1,800-2,900 per month for the GPU rental alone if you do not. Larger MoE deployments on 8×H100 in FP8 run $17,500-23,000 per month for the GPU rental.

At an average mixed I/O of 2,000 input tokens and 500 output tokens per call, the frontier path costs roughly $0.025 per call. The single-H100 open-weight deployment breaks even at around 800,000-1,200,000 calls per month, depending on how aggressively you cost the operational overhead. The 8×H100 deployment breaks even at around 700,000-900,000 calls per month against the same frontier comparison.

Two cautions on this maths. First, frontier price-per-token has fallen at roughly 90% per year for the last three years; the crossover keeps moving against on-prem economics. Second, the all-in cost of on-prem is consistently underestimated — power and cooling at $0.10-0.15/kWh for an H100 server consuming 700W under load is not the headline number, but it is real, and so is the ops team you need to keep the cluster healthy.

The economic case for on-prem is not "it is cheaper at scale" — it sometimes is and sometimes is not. The economic case is "we would have to do this anyway because the data cannot leave, so the cost question is on-prem versus not building it at all."

The quality trade-off, honestly

Open-weight models have closed enough of the gap to make on-prem the right call for most enterprise workloads. The gap is narrow on classification, structured extraction, RAG synthesis, summarisation, and most short-form generation. The medium-sized variants of Llama, Mistral, and Qwen match or approach the frontier on these tasks for production purposes.

The gap is still meaningful at the frontier of reasoning, long-horizon agentic work, and certain multimodal tasks. The leading frontier models set records on agentic benchmarks; the open-weight ecosystem is one tier behind, not two. If your workload genuinely depends on this frontier — and most don't — that gap matters.

The trap to avoid is benchmark-shopping. The right question is not "which model wins on MMLU?" but "does the open-weight option meet our acceptance criteria on our own evaluation set?" The answer for classification, retrieval, and structured extraction is almost always yes. The answer for long-horizon agentic code generation is sometimes no.

The four-quadrant decision logic

The pattern that fits is determined by two axes: data sensitivity and latency requirement.

	Low sensitivity	High sensitivity
Latency-tolerant (>500ms ok)	Frontier API. Default.	On-prem foundation model or hybrid (sanitised frontier).
Latency-bound (<200ms)	Edge inference for the local hop, frontier for the heavy lift.	On-prem foundation model or edge inference. Frontier is out.

Two clarifications on the table. First, "low sensitivity" is rarer than people think — the moment a workload touches PII, financial data, or any sector-regulated information, you are in the high-sensitivity quadrant. Second, the latency axis is about user-facing latency, not model latency; a 30-second background job is latency-tolerant even if the user is waiting for it.

Federated learning sits outside this table because it is a training-time pattern. The right way to read it is: if your high-sensitivity workload also requires training across data that cannot be pooled, federated learning is the training architecture that produces the on-prem model you then deploy.

The architecture pattern that ships

For most regulated enterprises, the default that holds up across audits is hybrid: on-prem preprocessing for sensitive fields, frontier inference on sanitised input, audit trail covering both surfaces.

The reference shape we deploy is straight: a sanitisation service running on customer-controlled hardware (typically a single H100 or A100 hosting a 7-13B classifier and a vault), a policy layer that decides per-field whether to redact, tokenise, or pass through, a frontier API call against the sanitised payload, and a re-hydration step that maps tokens back inside the trust boundary before the response is shown to a user. The audit trail records what was redacted, what was sent, what came back, and who saw the re-hydrated answer.

This pattern wins for three reasons. The regulatory posture is defensible — the frontier model never sees raw data, so the cross-border transfer question is about tokens, not records. The cost posture is favourable — the on-prem footprint is small, the frontier API absorbs the heavy reasoning, and per-call costs scale with sanitised prompt size. The quality posture is acceptable — frontier reasoning is preserved for the parts of the workload that need it, while the residency-sensitive parts run locally.

The teams that build this badly do one of two things: they over-redact (the frontier model gets a payload so stripped that it cannot reason well) or they under-tokenise (joinable identifiers leak through and the regulator finds them). Getting the tier of tokenisation right per field — salted hash for one-way checks, deterministic HMAC for joins, FF1 for fields that need to look real, vault for fields that need to come back — is what separates the architectures that audit clean from the ones that don't.

What this teaches us about enterprise scaling

The shift from "send everything to the frontier" to "decide per field, per workload, per jurisdiction" is the most consequential architectural change in enterprise AI since RAG. It is also the change that most clearly distinguishes teams that have shipped regulated AI from teams that have not.

The teams that shipped frontier-only architectures in earlier years are now retrofitting tokenisation layers. The teams that designed for hybrid from the start are extending into edge and on-prem as the workload mix demands. The pattern that is no longer credible is "we'll figure out residency later" — the regulatory deadlines have moved from theoretical to enforceable, and the architectural debt of bolting on residency after the fact is large enough that we now treat it as a core architecture decision, not a deployment-time toggle.

The right mental model is that the model is a component, not a destination. The architecture is the policy layer, the sanitisation layer, the routing layer, the audit layer, and the residency boundary. The model — frontier, on-prem, edge, federated — fits inside that architecture. Treating it the other way around is the mistake.

FAQs

What does "federated AI" actually mean in 2026?

Four technically distinct things that share only the property that data does not flow to a third-party frontier model. Federated learning is a training pattern where only encrypted weight updates leave each node. On-prem foundation models are pre-trained open-weight models on customer hardware. Hybrid architectures sanitise on-prem then call a frontier API. Edge inference runs small models on the device producing the data. The legal posture, cost, and complexity of each is different.

When does on-prem beat frontier on economics?

At roughly 800,000 to 1,200,000 calls per month on a single H100, or 700,000 to 900,000 calls per month on an 8×H100 MoE deployment, against a $5/$30-per-million-token frontier comparison and a 2,000-in/500-out mixed I/O. Two cautions: frontier prices have fallen ~90% per year for three years, so the crossover keeps moving, and the all-in cost of on-prem (power at $0.10-0.15/kWh, ops, cooling) is consistently underestimated.

Should we use federated learning?

Only if your problem is genuinely "we need to train across data that legally cannot be pooled" — across hospitals, banks, or regulated zones. Federated learning is a training-time architecture. It does not solve inference-time residency. Once you have the global model you still have to host it somewhere, and that hosting decision is one of the other three patterns. We see it used most often as a complement to on-prem foundation models, not a replacement.

What does the hybrid sanitisation pattern look like in production?

Three-layer PII detection (regex with checksums, transformer NER like DeBERTa, an LLM-judge pass for edge cases), tiered tokenisation (salted hash for one-way comparison, deterministic HMAC for joins, NIST 800-38G FF1 for fields that need to look real, vaulted tokenisation for fields that need to come back), and graduated policy enforcement — allow, warn, block, or route to an approved internal model. The frontier model only ever sees synthetic or pseudonymised data; the vault stays in jurisdiction.

Have open-weight models really closed the gap?

For most enterprise workloads, yes — classification, structured extraction, RAG synthesis, summarisation, and short-form generation. The medium-sized variants of Llama, Mistral and Qwen match or approach the frontier on these for production purposes. The gap is still meaningful at long-horizon agentic work and certain multimodal tasks. The right question is not "which model wins on MMLU" but "does the open-weight option meet our acceptance criteria on our own evaluation set."

Companion content

How to engage

If you are deciding between on-prem, hybrid, edge, and federated patterns for a specific regulated workload — and you want the residency question pressure-tested against a real architecture rather than against vendor marketing — we run paid architecture reviews that produce a decision document with a defensible cost, quality, and regulatory posture. Talk to us at creativeminds.dev/contact.