Multi-Agent Orchestration: When CrewAI, LangGraph, and Custom Stop Being the Same Conversation

Key takeaways

CrewAI, LangGraph, Microsoft Agent Framework, OpenAI Agents SDK and custom are not interchangeable bets — each optimises for a different axis, and the differences only become visible at 02:00 when on-call has to reconstruct a failed run.
Four production constraints decide framework choice: state management, observability, error semantics, and deployment model. CrewAI is fastest to prototype, LangGraph deepest on state and observability, Microsoft strongest on built-in patterns, OpenAI cleanest if your stack is OpenAI-first, custom is the only honest answer when constraints do not match any framework's defaults.
The production architecture that emerges repeatedly is the same shape regardless of starting framework: explicit state machine with typed state and pluggable checkpointer, custom Python tool layer, OpenTelemetry-grade tracing, and a deployment substrate the platform team can run.
Four concerns sit outside what any framework provides — multi-tenant isolation, secrets rotation in long-running flows, principal scoping as intersection (never union) across handoffs, and audit-trail integration. These decide procurement reviews for banks and regulated operators.
Framework choice is the reversible decision. Platform decisions — state, observability, error semantics, deployment substrate — are not. Optimising for the reversible decision while leaving the irreversible ones implicit is the most common pattern of failure.

Figure 1 — Each option optimises for a different axis · most production systems end up stacking all three plus the platform work no framework owns.

The framework discourse, and the question underneath it

Every quarter, a new multi-agent framework arrives with a launch post, a benchmark, and a community of early adopters convinced the previous one is now obsolete. CrewAI crossed forty-five thousand stars and powers, by its own count, more than twelve million daily agent executions. LangGraph is sitting inside the LangChain ecosystem at ninety-seven thousand stars with LangSmith and LangServe alongside it. Microsoft retired AutoGen into maintenance mode and shipped Agent Framework 1.0 for .NET and Python. OpenAI deprecated Swarm in favour of the Agents SDK, which reached a model-native harness and sandbox execution.

The framework discourse is loud. The engineering question underneath it is quieter: when a production multi-agent system fails at 02:00, what does the on-call engineer need to reconstruct what happened, fix the cause, and resume the run without losing state? Read the frameworks against that question and the apparent symmetry between them breaks. They are not the same conversation. They optimise for different things, and the things they optimise for trade against each other.

This piece reads each option on engineering terms. The pattern we recommend is not loyalty to one framework. It is matching framework to constraint, and being honest about what no framework solves.

What each framework actually optimises for

CrewAI optimises for the legibility of agent teams to non-engineers. You define agents with a role, a backstory, and a goal. You assign tasks. The framework handles the coordination loop with prompt templates that surface as readable structure. For a content pipeline, a research workflow, or anything that maps cleanly onto a human team structure, this is the fastest path to a working prototype — often two to four hours from blank file to running crew. The metaphor carries real engineering value because product stakeholders can read the configuration. The cost is that customisation lives behind abstraction. Once you need conditional routing that does not match the task model, or you need to debug a stuck agent in step seven of a twelve-step crew, you are working against the framework, not with it. Observability is being added but remains less mature than what the LangChain ecosystem offers.

LangGraph optimises for the explicit state machine. Agents are nodes in a directed graph with a typed shared state. Transitions are conditional functions. Checkpoints are first-class. The cost is verbosity — a graph that does in CrewAI what fits on a page takes more code in LangGraph. The benefit is that the verbose code is the documentation. When the system breaks, the graph tells you which node held control, what state it received, and what state it produced. The integration surface around it — LangSmith for tracing, LangServe for deployment, the AgentCore checkpointer for AWS-native persistence — is the broadest in the ecosystem. For stateful, auditable, long-running flows, this is the most battle-tested option.

Microsoft Agent Framework is the production-grade successor to AutoGen. AutoGen contributed the conversation-based, actor-model architecture and the research-grade orchestration patterns: sequential, concurrent, handoff, group chat, and the Magentic-One planner. The 1.0 release promoted these to stable, added streaming, checkpointing, human-in-the-loop approvals, and pause/resume across both .NET and Python, and merged in the enterprise scaffolding from Semantic Kernel. For organisations already on the Microsoft identity and Azure substrate, this is the path of least friction. AutoGen continues for research and prototyping but receives no new features.

OpenAI Agents SDK is what Swarm became. Swarm was deliberately small and deliberately experimental — a reference design for lightweight handoffs between agents. The Agents SDK kept the simplicity and added what production needs: persistent sessions, guardrails, tracing, and a model-native harness that owns the agent loop, tool routing, handoffs, approvals, and run state. The current release added sandbox execution with providers including Cloudflare, E2B, Modal, Daytona, Vercel, Runloop, and Blaxel. For OpenAI-first teams, the SDK is now the supported production path. Swarm itself should be read as a reference design, not a deployment target.

Custom means something specific in the teams we work with. It is not a from-scratch reimplementation. It is a thin state-machine layer in Python, typed with Pydantic, deployed on the team's existing infrastructure-as-code substrate, with prompt-tuning and tool-calling logic written explicitly. Teams reach for custom when the framework abstractions are paying a cost they cannot recover — when the workflow does not match any framework's coordination model, when the audit and compliance surface needs to be inspectable line-by-line, or when the deployment substrate already imposes constraints (regulated environments, air-gapped infrastructure, specific identity providers) that the frameworks treat as edge cases.

The four production constraints that decide the choice

Framework choice is decided by four constraints. The frameworks all do tool-calling. They all do orchestration. The differences appear here.

State management asks where conversation state lives, who owns checkpoints, and what happens to in-flight runs on deployment. LangGraph treats state as a typed object that flows through the graph, with pluggable checkpointers — Postgres, Redis, Bedrock AgentCore Memory, DynamoDB with S3 offloading, or Valkey. CrewAI keeps state inside the crew object with optional memory plug-ins. The OpenAI Agents SDK ships persistent sessions. Microsoft Agent Framework gives you checkpointing and pause/resume on long-running workflows. Custom puts state wherever your platform team already runs stateful services. The right question is not "does it have checkpoints" but "what do my checkpoints cost to inspect, replay, and migrate."

Observability asks whether an on-call engineer can reconstruct what happened from logs. Multi-agent failures are not single-call failures. They appear in causal chains across steps where the output of one step conditions the next, tool invocations and their interpretations are non-deterministic, and the failure mode is often a wrong handoff or a corrupted state field rather than an exception. LangSmith is the deepest integration for LangGraph stacks and adds virtually no measurable overhead. Langfuse is the strongest self-hosted option for teams with data residency requirements, at the cost of higher overhead — around fifteen percent in step-level instrumentation. Observability is the constraint that breaks teams who picked frameworks for prototype velocity and now cannot debug production.

Error semantics asks what happens when an agent fails mid-flow. Three sub-questions sit underneath: does the framework retry, does the framework hand off to a fallback agent, and does the framework surface the failure to a human approver. LangGraph treats this as graph design — you express retries, fallbacks, and human-in-the-loop nodes as edges. Microsoft Agent Framework ships the patterns natively across all orchestrations. CrewAI handles retries inside the task abstraction with less control over fallback routing. The OpenAI Agents SDK exposes approvals and tracing as part of the model-native harness. Custom forces you to design this explicitly, which is either the cost or the value depending on your compliance posture.

Deployment model asks where the agents run. Single-process is the prototype. Distributed is what most teams build. Serverless is what scales without operational overhead — and is the deployment model Bedrock AgentCore was built for, with LangGraph multi-agent systems deployable directly onto the AgentCore Runtime with checkpointing into Bedrock Session Management Service or ElastiCache Valkey. The OpenAI Agents SDK integrates with Cloudflare, Modal, E2B, Vercel, and Temporal. CrewAI Enterprise runs the same code on the team's own infrastructure. Microsoft Agent Framework runs on Azure with the identity and isolation patterns of that ecosystem. Custom runs wherever the team already runs services. The question is not which deployment model is best. It is which deployment model your existing operational maturity can support.

Decision matrix

The matrix below maps each framework against the four constraints. Strengths are stated first, gaps second. Read it as a starting position for a conversation, not a verdict.

Framework	State management	Observability	Error semantics	Deployment model
CrewAI	In-crew state plus optional memory plug-ins. Fastest to start; least control over storage backend.	Improving but trails LangSmith for tracing depth.	Retries inside the task abstraction; fallback routing is awkward.	Self-host or CrewAI Enterprise. Single-process by default.
LangGraph	Typed state with pluggable checkpointers. Most flexible.	LangSmith integration is the deepest in the ecosystem. Langfuse is the self-hosted alternative.	Express retries, fallbacks, and human-in-the-loop as graph edges. Most explicit.	Self-host, LangServe, or AgentCore Runtime for serverless. Strongest deployment surface.
Microsoft Agent Framework	Checkpointing and pause/resume across orchestrations. Strong on Azure.	Tracing via OpenTelemetry plus Azure Monitor integration.	Patterns ship natively. Human-in-the-loop approvals built in.	Azure-native. .NET and Python parity.
OpenAI Agents SDK	Persistent sessions plus model-native run state.	Tracing in the model-native harness.	Approvals and handoffs first-class; guardrails ship in the SDK.	Sandbox providers include Cloudflare, Modal, E2B, Vercel, Daytona.
Custom (Python + Pydantic + your IaC)	Wherever your platform team already runs stateful services.	Whatever your platform team already runs for tracing.	Whatever your platform team designs explicitly.	Wherever your platform team already deploys services.

The pattern is clear when you read it column by column. LangGraph is the deepest on state and observability. Microsoft Agent Framework is the strongest on built-in orchestration patterns. The OpenAI Agents SDK is the cleanest if your tooling is OpenAI-first. CrewAI is the fastest from blank file to working prototype. Custom is the only honest answer for teams whose constraints do not match any framework's defaults.

The honest production architecture

Most enterprise teams we work with converge on a similar shape, regardless of which framework they started with. It is not the architecture any single framework documents. It is the architecture that survives the four constraints.

The state-machine layer is LangGraph. Not because it is fashionable but because the graph is legible to auditors, the typed state is legible to engineers, and the checkpointer choice — Postgres for self-host, Bedrock AgentCore Memory for AWS-native, ElastiCache Valkey for low-latency — is decoupled from the framework. Teams running on Azure substitute Microsoft Agent Framework here; teams OpenAI-first substitute the Agents SDK. The shape is the same: an explicit state machine, typed state, pluggable storage.

The tool-calling layer is custom Python. Tools are Pydantic-typed functions with explicit input and output contracts, registered into the agent runtime through the framework's tool-binding API. The reason this layer is custom even when the orchestration layer is not is that tools touch the team's existing services — auth, databases, secrets, external APIs — and the framework's tool abstractions are not allowed to dilute the contracts those services already enforce.

The observability surface is LangSmith for teams already on the LangChain ecosystem, Langfuse for teams with self-hosted data residency requirements, or homegrown tracing built on OpenTelemetry for teams whose platform stack already standardises on it. The constraint is not which tool. The constraint is that traces are correlatable from the user-facing request through every agent step, every tool call, every model invocation, and every state mutation.

The deployment substrate is one of three: Bedrock AgentCore Runtime for AWS-native serverless multi-agent deployments, Modal for teams that want serverless Python without committing to a cloud vendor's agent platform, or ECS Fargate (or its equivalent on other clouds) for teams that need long-running container processes with predictable cost. Each maps to a different operational posture. The choice belongs to the platform team, not the agent team.

This is the architecture that emerges, repeatedly, from teams that have stopped building demos and started running production. It is not a framework decision. It is four decisions, each independent, each owned by a different role.

What none of these solve out of the box

Four production concerns sit outside what any framework provides. They are exactly the concerns most likely to surface during a compliance review, a security audit, or a regulator's questionnaire.

Multi-tenant isolation. Frameworks ship with single-tenant defaults. Multi-tenancy demands five identity layers — trigger identity, execution identity, authorization identity, tenant identity, and session identity — with authentication happening before tenant resolution, tenant resolution before session derivation, and session loading before context compilation. Memory operations must be prefixed with tenant-scoped namespace paths. The framework does not enforce this. Your platform team does.

Secrets rotation in long-running flows. An agent run that lasts six hours may begin with one set of credentials and need to rotate to another mid-flow. None of the frameworks have a native answer. The pattern that works is short-lived tokens minted from a secrets broker per tool call, with the broker enforcing the rotation contract.

Principal scoping across handoffs. When agent A hands off to agent B, the effective authority of B must be the strict intersection of B's baseline permissions and the requesting user's permissions — never the union, never B's baseline alone. The runtime evaluates a tuple of claims (user, agent, scopes, audience, tenant, task ID, expiry) at every tool call. Frameworks do not enforce this intersection. Your authorization layer does.

Audit-trail integration. Production agents act on behalf of identified users in regulated contexts. Every action with an external side-effect needs an audit record that ties the action to the requesting user, the approving agent, the prompt that led to the decision, and the model that generated it. Tracing tools capture the technical execution. Audit-trail integration is a separate concern, owned by compliance, and must be wired in explicitly.

These are the concerns that decide whether a multi-agent system passes a procurement review for a bank or a regulated operator. None of them are framework features. They are platform-engineering work that sits underneath whatever framework you pick.

What this teaches us about enterprise scaling

The framework discourse trains attention on the wrong variable. Reading CrewAI against LangGraph against Microsoft Agent Framework against OpenAI Agents SDK as competing bets misses that the choice between them is, in production, a smaller decision than the four that surround it: where state lives, what observability you can sustain, how error semantics map to your operational posture, and what deployment substrate your platform team can run.

Enterprise scaling of multi-agent systems is not a framework problem. It is a platform-engineering problem dressed in framework language. The teams that scale these systems pick the framework that costs the least to integrate with the platform decisions they have already made — and then spend the real engineering effort on the four concerns no framework solves: tenancy, secrets, scoping, and audit. The framework choice is reversible. The platform decisions are not. Optimising for the reversible decision while leaving the irreversible ones implicit is the most common pattern of failure we see.

FAQs

Which framework should we start with for a new multi-agent system?

The wrong question. Start with the four constraints — where state lives, what observability you can sustain, how error semantics map to your operational posture, and what deployment substrate your platform team can run. The framework that costs the least to integrate against those decisions is the right one. CrewAI is fastest to prototype; LangGraph is the most defensible for stateful, auditable, long-running flows.

Is AutoGen still a viable choice?

No, not for new production work. Microsoft retired AutoGen into maintenance mode and shipped Agent Framework 1.0 for .NET and Python as the production-grade successor, with the AutoGen actor-model architecture, streaming, checkpointing, human-in-the-loop approvals, and pause/resume promoted to stable. AutoGen continues for research and prototyping but receives no new features. Likewise, Swarm became the OpenAI Agents SDK; Swarm itself is a reference design, not a deployment target.

When does "custom" make sense over a framework?

When the framework abstractions are paying a cost you cannot recover — when the workflow does not match any framework's coordination model, when the audit and compliance surface needs to be inspectable line-by-line, or when the deployment substrate already imposes constraints (regulated environments, air-gapped infrastructure, specific identity providers) that the frameworks treat as edge cases. Custom in our usage means a thin state-machine layer in Python typed with Pydantic, on the team's existing IaC substrate. Not a from-scratch reimplementation.

What do frameworks not solve out of the box?

Multi-tenant isolation (five identity layers with authentication before tenant resolution before session derivation), secrets rotation across long-running flows (short-lived tokens minted per tool call from a secrets broker), principal scoping across handoffs (effective authority is strict intersection of agent baseline and user permissions, never union), and audit-trail integration that ties every external side-effect to the requesting user, approving agent, prompt and model. These are platform-engineering work, not framework features.

What does the observability bar look like in production?

Traces must be correlatable from the user-facing request through every agent step, every tool call, every model invocation, and every state mutation. LangSmith is the deepest for LangGraph stacks with virtually no measurable overhead. Langfuse is the strongest self-hosted option for data-residency requirements at about fifteen percent step-level instrumentation overhead. The constraint is not which tool — it is that an on-call engineer can reconstruct what happened from logs at 02:00.

Companion content

How to engage

The pattern we recommend is not the framework. It is the four-decision architecture underneath — state, observability, error semantics, deployment — designed before the framework choice and owned by the platform team that will run it. Talk to us at creativeminds.dev/contact.