MCP in Production: What the Spec Doesn't Tell You, and What We Wish We'd Known Before Shipping

It is three minutes past eight on a Wednesday evening and the on-call's phone is buzzing for the third time in twenty minutes. The MCP server has been in production for six weeks. The diagnostics are useless: no spike in CPU, no spike in memory until just before the failure, no obvious error in the logs. By the time she opens the laptop, the symptom has cleared. By the time she goes to bed, she has been paged twice more. By Friday she has identified the cause — a subscription cascade left open by clients that vanished without sending HTTP DELETE — and the fix takes one afternoon. The spec, which is otherwise a careful and considered document, never warned her this would happen. It could not have. The spec describes the wire format. The phone calls describe the operations.

The MCP specification is the closest thing the agent ecosystem has to a settled standard. The 2025-06-18 revision was the inflection point: it removed JSON-RPC batching, mandated OAuth 2.1 with Resource Indicators (RFC 8707), added Elicitation and Structured Tool Outputs, and pinned the protocol version on every request header. The November 2025 revision tightened a few more edges around server-sent events and tool annotations. Every major model client now speaks it natively. Every serious agent runtime ships MCP support. The protocol is, in the engineering sense, done.

What the spec gives you is the wire. What the spec does not give you is the road — the patterns that break first under load, the auth landmines, the resource leaks that present as memory pressure but are actually session lifecycles. Six months of running MCP servers and clients in production teaches you the rest, and what we wish we had known before shipping is what follows.

Key takeaways

Streamable HTTP only survives production load with sticky sessions or shared Redis session state — the spec mentions Mcp-Session-Id but does not warn you that proxies without session pinning fail intermittently and unprovably.
OAuth 2.1 with Resource Indicators (RFC 8707) and PKCE is roughly half the work of shipping a remote MCP server; budget for it accordingly, and test that your provider honours Resource Indicators because the wrong-server token failure mode is silent.
Sessions are resources that leak in three patterns — disappeared clients, orphaned SSE streams, and subscription cascades — and the third one looks like a memory issue but is actually connection-pool exhaustion at hour eight.
Tool design is the highest-leverage MCP work: use the new outputSchema field, treat descriptions as prompt engineering, do not trust destructiveHint as a security boundary, and adopt Elicitation to stop the model inventing parameters.
MCP is a wire format, not an architecture — the agent loop, the safety surface, the eval gate, and the kill switch are still your problem and live outside the protocol.

MCP production topology — client to load balancer to two MCP server replicas with the Mcp-Session-Id header pinning routing, the SSE buffering failure mode called out, and the protocol-version-on-every-request requirement. Below: three session leak patterns (disappeared clients without HTTP DELETE, orphaned SSE streams holding request context, subscription cascades exhausting the connection pool at hour eight) each with symptom and fix. Below that: three OAuth 2.1 traps (Resource Indicators per RFC 8707 with silent failure if misconfigured, Dynamic Client Registration with no clean ecosystem answer, PKCE on confidential clients) — OAuth is half the work of shipping a remote MCP server. — Figure 1 — Transport pinning, three session leak patterns, OAuth as half the work

A One-Way Door at the Transport Layer

The spec defines two transports. Stdio for local processes. Streamable HTTP with optional SSE for remote servers. Most production deployments need remote, and Streamable HTTP became the official remote transport in the March 2025 revision, replacing the old HTTP+SSE pattern. Servers can stream responses SSE-style or batch them, clients can resume disconnected streams, and the protocol behaves correctly under load balancers. Or at least, behaves correctly under load balancers if you have read the small print, which is mostly not in the document.

Two things the spec does not foreground, both of which surface in incident channels by week three of any production deployment.

The first is that Streamable HTTP works cleanly across load balancers only if you pin sessions. The protocol uses an Mcp-Session-Id header that the server sets on its first response to initialize and that the client must echo on every subsequent request. If your load balancer routes the same session ID to different backends, the second backend has no state and rejects with a 404. The fix is sticky sessions or shared session storage in Redis. The spec mentions session IDs. It does not warn you that production deployment without one of these two patterns will fail intermittently and unprovably under load — the kind of failure that disappears the moment you log in to debug it, like a creaking floorboard that goes silent when you crouch to find the loose nail.

The second is that SSE under streaming responses interacts poorly with proxies that buffer. Cloudflare's default behaviour, AWS Application Load Balancer with the wrong idle-timeout setting, nginx without proxy_buffering off — all three can buffer the SSE stream until the connection closes, which collapses streaming into batch. The symptom is that the client appears to hang for thirty seconds and then receives the whole response at once, like a tap that empties an entire bucket in one rush instead of pouring a steady stream. The fix is explicit no-buffer headers on the response, proxy configuration that respects them, and an ALB idle timeout raised beyond the length of your longest streaming response. None of this is in the spec because none of it is in scope. All of it is in your production incident channel by week three.

Half the Work Lives in OAuth

The 2025-06-18 spec made OAuth 2.1 with PKCE mandatory for remote MCP servers, classified them as OAuth Resource Servers, and required Resource Indicators (RFC 8707) on token requests so tokens get scoped to specific servers. This is the right architecture. It is also the part of the implementation that takes the longest to get right, because OAuth providers in the real world honour standards approximately the way British weather honours the calendar.

The first trap is token audience confusion. A client wants to call several MCP servers — a Notion server, a Linear server, an internal data-tools server. Without Resource Indicators, the same access token could be replayed against any of them, and a compromised server could exfiltrate a token scoped for another. RFC 8707 fixes this by binding the token's audience to a specific resource. The problem is that many existing OAuth providers do not honour Resource Indicators correctly. Auth0, Okta, and Entra all support them, but the configuration is non-obvious and the failure mode — token accepted by the wrong server — is silent. Test this explicitly. The bug that does not announce itself is the bug that ships.

The second is Dynamic Client Registration. The spec strongly recommends DCR (RFC 7591) so clients can register against new servers automatically. Most consumer OAuth providers do not allow public-client DCR by default. The workaround is a server-side proxy that pre-registers clients, which re-introduces the secret-management problem DCR was meant to solve. We landed on a per-tenant client registry with admin-level DCR, functional but not what the spec optimises for. There is no clean answer here yet. The ecosystem will get there. Today, you are choosing between two compromises.

The third is PKCE on confidential clients. The spec mandates PKCE even for confidential clients, and some providers reject this as a configuration error. The workaround is to configure the provider for "authorisation code with PKCE" explicitly rather than "authorisation code" with a PKCE flag. A minor difference in two menu items that turns into a half-day of debugging if you do not know to look for it.

The summary is simple. OAuth is half the work of shipping a remote MCP server. Budget for it accordingly.

Sessions Are Resources, and Resources Leak

The session ID model means every connected client holds server-side state — tool calls in flight, subscription state, cached metadata, the open SSE stream. The spec lets sessions terminate explicitly via an HTTP DELETE to the session endpoint or implicitly via timeout. In production, neither end of the protocol terminates cleanly as often as the spec writers might have hoped. Sessions are like coats in a cloakroom: most are picked up, some get forgotten, and a few are claimed by people who never put them there in the first place.

Three leak patterns surface, in roughly increasing order of subtlety.

The first is clients that disappear without DELETE. Browser closed, process killed, network partition. The session lives on the server until the inactivity timer fires. Set the timer too high and sessions accumulate. Set it too low and legitimate idle clients — an IDE waiting for the next user prompt — get dropped and have to re-initialise. Tune to the workload. We landed on 30 minutes for IDE clients, 5 minutes for ephemeral agent runs.

The second is orphaned SSE streams. A client opens a stream, the upstream LLM call hangs, the client times out at the HTTP level, the server's SSE iterator does not know. The stream sits in the event loop holding a reference to a request context, the garbage collector will not collect it, memory grows. The fix is aggressive read-timeouts on the server's SSE write side plus context cancellation propagation from the original request. Both are easy to forget. Both end up on the postmortem.

The third is the subscription cascade — the one that woke our on-call at 20:03 on Wednesday. A tool that returns a notifications/resources/updated subscription holds server-side handles to upstream systems. When the session goes away without an explicit unsubscribe, those handles stay open. The leak presents as a memory issue. The actual cause is connection-pool exhaustion: after eight hours of accumulated dead subscriptions, new sessions cannot acquire downstream connections, and the server starts rejecting work. The fix is session-cleanup hooks that release every subscription and downstream connection on session end, plus an audit log that lets you reconstruct which subscriptions belonged to which session. The fix takes an afternoon to implement. Finding the cause takes a week.

Tool Design Is Where the User Lives

The spec defines the shape of tools/list and tools/call. It does not tell you how to design tools that an LLM can use reliably, and this is the single highest-leverage piece of MCP work and the one most often under-invested. A well-written tool description is to an LLM what a clear road sign is to a driver — the difference between confident progress and a U-turn.

Output schemas matter as much as input schemas. The 2025-06-18 revision added Structured Tool Outputs via the outputSchema field on the tool definition. Use it. LLMs do better with structured outputs they can compose with than with prose blobs they have to re-parse. A tool that returns { "results": [{ "id": 1, "name": "Acme Corp" }], "total": 1 } lets the agent reason about the result with the structure already present. A tool that returns "Found 1 result: Acme Corp" forces the agent to re-extract the structure it has just been handed, and to do so every single time.

Tool descriptions should optimise for the LLM, not for the human reading the documentation. The description field is in the prompt. Treat it as prompt engineering: concrete examples, the typical use case in one sentence, the failure modes the model should expect. Wall-of-text descriptions inflate the prompt without improving model behaviour. Sharp ones reduce both prompt size and tool-call error rate. It is the difference between writing for the manual and writing for the cockpit.

Annotations are useful, but treat destructiveHint and readOnlyHint with suspicion. The June 2025 spec lets servers annotate tools with these flags plus idempotentHint. The spec is explicit that they are advisory and not security boundaries. We have seen agents that treat them as policy. They are not policy. If a tool is genuinely destructive, gate it behind explicit approval — the agent-action-approval-gates and OPA-for-agent-action patterns are the load-bearing layer here, not the annotation. The annotation is the warning sign on the cliff edge. It is not the fence.

Elicitation is the most under-used feature in the protocol. The June 2025 spec added Elicitation, which lets servers request additional information from the user mid-interaction. This is the right pattern for "I need approval before doing this," "what timezone are you in," "did you mean Q4 2025 or Q4 2026." It avoids the agent inventing an answer to fill a gap in the user's prompt — the way a doctor asks the patient a clarifying question rather than guessing the symptom. Most server implementations have not yet adopted Elicitation. Adopting it pays for itself in fewer hallucinated parameters within a few weeks of production traffic.

Version Negotiation as a Contract

The June 2025 revision moved the protocol version from negotiation-only to required on every HTTP request, carried in the MCP-Protocol-Version header. This sounds defensive. It is the difference between intermittent compatibility breakage and reliable rejection. The pattern that works: clients send the version they negotiated at initialize, servers either accept and respond or reject with a 400 and an explicit error. No silent downgrades. No mismatched behaviour mid-session.

The operational corollary is uncomfortable for teams running partial upgrades. If multiple server versions sit behind a load balancer, the older versions need to reject newer protocol versions cleanly rather than pretend to support them. Test the rejection paths. They are the failure mode that surfaces only when you do a partial upgrade, which is exactly when you do not want to be debugging the protocol layer.

The Spec Does the Architecture, Operations Does the Rest

MCP is the cleanest agent-tool wire format the ecosystem has produced. It is also operationally honest in a way the spec does not always advertise. Shipping it in production demands the same discipline as shipping any other production protocol — session lifecycle, transport tuning, auth that survives real OAuth providers, observability that lets you triage at 02:00. The spec does the architectural work. Operations does the rest, the way the architect draws the building and the plumber decides which valve sits behind which wall.

Two takeaways hold up for teams adopting MCP now.

Standardise on Streamable HTTP for everything remote, sticky sessions for everything stateful, and an OAuth provider you have configured correctly for Resource Indicators. The combinations of these three that work are smaller than the combinations on offer, and the gap between the working set and the not-quite-working set is the gap between a smooth quarter and a series of midweek pages.

Treat MCP as a substrate for tool composition, not as the destination architecture. The agent loop, the safety surface, the eval gate, the kill switch — these are not in the spec because they are not protocol concerns. They are your concerns. The protocol is a wire format. The architecture is everything else, and the difference between the two is the difference between the on-call who sleeps through Wednesday night and the one who does not.

FAQs

What changed in the 2025-06-18 MCP revision and why does it matter for production?

The June 2025 revision removed JSON-RPC batching, mandated OAuth 2.1 with PKCE and Resource Indicators (RFC 8707), added Structured Tool Outputs and Elicitation, and pinned the protocol version on every HTTP request via the MCP-Protocol-Version header. The combined effect is that compatibility breakage is loud and explicit instead of silent, auth is finally tight enough to use across multi-server clients, and the wire format gives the LLM structured data it can reason about.

Why do streaming MCP responses hang under load balancers and proxies?

Cloudflare default behaviour, AWS ALB with the wrong idle-timeout, and nginx without proxy_buffering off all buffer the SSE stream until the connection closes, which collapses streaming into batch and looks like a thirty-second hang. Fix this with explicit no-buffer response headers, proxy configuration that respects them, and ALB idle timeout raised beyond your longest streaming response.

How do you stop MCP sessions from leaking server resources?

Tune the inactivity timer to the workload (we run 30 minutes for IDE clients, 5 minutes for ephemeral agent runs), set aggressive read-timeouts on the server's SSE write side with context cancellation propagation, and add session-cleanup hooks that release every subscription and downstream connection on session end. The subscription cascade is the leak that surfaces at hour eight as connection-pool exhaustion, not as memory pressure.

Are destructiveHint and readOnlyHint safe to use as security boundaries?

No. The June 2025 spec is explicit that these annotations are advisory, not policy. Treat them as model-hint metadata. For genuinely destructive tools, gate execution behind explicit approval workflows — the agent-action-approval-gates and OPA-for-agent-action patterns are the load-bearing layer, not the annotation.

Should we adopt Elicitation in our MCP servers?

Yes, and it is the most under-used feature in the protocol. Elicitation lets a server request additional information from the user mid-interaction — the right primitive for approval prompts, ambiguity resolution, and clarifying missing parameters. Without it, the model invents an answer to fill the gap. Adopting it pays for itself in fewer hallucinated parameters within a few weeks of production traffic.

Companion content

How to engage

If your team is shipping MCP servers or clients and hitting the operational edges, we have shipped this in production and can shortcut the learning. Talk to us at creativeminds.dev/contact.