The MCP specification is now the closest thing the agent ecosystem has to a settled standard. The 2025-06-18 revision was the inflection point — it removed JSON-RPC batching, mandated OAuth 2.1 with Resource Indicators (RFC 8707), added Elicitation and Structured Tool Outputs, and pinned the protocol version on every request header. The November 2025 revision tightened a few more edges around server-sent events and tool annotations. Every major model client now speaks it natively. Every serious agent runtime ships MCP support.
What the spec gives you is the wire format. What the spec does not give you is the operational shape — the patterns that break first under load, the auth landmines, the resource leaks that look like memory pressure but are actually session lifecycles. This is what six months of running MCP servers and clients in production teaches you, and what we wish we had known before shipping.
Key takeaways
- Streamable HTTP only survives production load with sticky sessions or shared Redis session state — the spec mentions Mcp-Session-Id but does not warn you that proxies without session pinning fail intermittently and unprovably.
- OAuth 2.1 with Resource Indicators (RFC 8707) and PKCE is roughly half the work of shipping a remote MCP server; budget for it accordingly, and test that your provider honours Resource Indicators because the wrong-server token failure mode is silent.
- Sessions are resources that leak in three patterns — disappeared clients, orphaned SSE streams, and subscription cascades — and the third one looks like a memory issue but is actually connection-pool exhaustion at hour eight.
- Tool design is the highest-leverage MCP work: use the new outputSchema field, treat descriptions as prompt engineering, do not trust destructiveHint as a security boundary, and adopt Elicitation to stop the model inventing parameters.
- MCP is a wire format, not an architecture — the agent loop, the safety surface, the eval gate, and the kill switch are still your problem and live outside the protocol.
Transport choice is a one-way door
The spec defines two transports. Stdio for local processes. Streamable HTTP with optional SSE for remote servers. Most production deployments need remote. Streamable HTTP became the official remote transport in the March 2025 revision, replacing the old HTTP+SSE pattern. Servers can stream responses (SSE-style) or batch them, clients can resume disconnected streams, and the protocol behaves correctly under load balancers.
Two things the spec does not foreground.
One. Streamable HTTP works cleanly across load balancers only if you pin sessions. The protocol uses an Mcp-Session-Id header that is set on the server's first response to initialize and must be echoed back on every subsequent request from that client. If your load balancer routes the same session ID to different backends, the second backend has no state and rejects with 404. The fix is sticky sessions or shared session storage in Redis. The spec mentions session IDs; it does not warn you that production deployment without one of these two patterns will fail intermittently and unprovably under load.
Two. SSE under streaming responses interacts poorly with proxies that buffer. Cloudflare's default behaviour, AWS Application Load Balancer with the wrong idle-timeout setting, nginx without proxy_buffering off — all three can buffer the SSE stream until the connection closes, which collapses streaming into batch. Symptom: the client appears to hang for thirty seconds, then receives the whole response at once. Fix: explicit no-buffer headers on the response, proxy configuration that respects them, ALB idle timeout raised beyond the longest streaming response. None of this is in the spec because none of it is in scope. All of it is in your production incident channel by week three.
OAuth 2.1 is the hardest part of the protocol
The 2025-06-18 spec made OAuth 2.1 with PKCE mandatory for remote MCP servers, classified them as OAuth Resource Servers, and required Resource Indicators (RFC 8707) on token requests to scope tokens to specific servers. This is the right architecture. It is also the part of the implementation that takes longest to get right.
Three traps we hit:
Token audience confusion. A client wants to call multiple MCP servers — a Notion server, a Linear server, an internal data-tools server. Without Resource Indicators, the same access token could be replayed against any of them, and a compromised server could exfiltrate a token scoped for another. RFC 8707 fixes this by binding the token's audience to a specific resource. The trap: many existing OAuth providers do not honour Resource Indicators correctly. Auth0, Okta, and Entra all support them, but the configuration is non-obvious, and the failure mode (token accepted by the wrong server) is silent. Test this explicitly.
Dynamic Client Registration. The spec strongly recommends DCR (RFC 7591) so clients can register against new servers automatically. Most consumer OAuth providers do not allow public-client DCR by default. The workaround is a server-side proxy that pre-registers clients, but this re-introduces the secret-management problem DCR was meant to solve. We landed on a per-tenant client registry with admin-level DCR, which is functional and not what the spec optimises for. There is no clean answer here yet.
PKCE on confidential clients. The spec mandates PKCE even for confidential clients. Some providers reject this as a configuration error. Workaround: configure the provider for "authorisation code with PKCE" explicitly rather than "authorisation code" with PKCE flag.
The summary is simple. OAuth is half the work of shipping an MCP server. Budget for it accordingly.
Sessions are resources, and resources leak
The session ID model means every connected client holds server-side state. Tool calls in flight, subscription state, cached metadata, the open SSE stream. The spec lets sessions terminate explicitly via an HTTP DELETE to the session endpoint or implicitly via timeout. In production, neither end of the protocol terminates cleanly as often as you would hope.
Three leak patterns we hit, in order of how subtle they are.
One — clients that disappear without DELETE. Browser closed, process killed, network partition. The session lives on the server until the inactivity timer fires. If the timer is set too high, sessions accumulate. If it is set too low, legitimate idle clients (an IDE waiting for the next user prompt) get dropped and have to re-initialise. Tune to the workload. We landed on 30 minutes for IDE clients, 5 minutes for ephemeral agent runs.
Two — orphaned SSE streams. A client opens a stream, the upstream LLM call hangs, the client times out at HTTP level, the server's SSE iterator does not know. The stream sits in the event loop holding a reference to a request context, the GC will not collect it, memory grows. The fix is aggressive read-timeouts on the server's SSE write side plus context cancellation propagation from the original request. Both are easy to forget.
Three — subscription cascades. A tool that returns a notifications/resources/updated subscription holds server-side handles to upstream systems. When the session goes away without explicit unsubscribe, those handles stay open. This is the leak that looks like a memory issue but is actually a connection-pool exhaustion. Symptom: after eight hours, new sessions fail to acquire downstream connections. Fix: session-cleanup hooks that release every subscription and connection on session end, plus an audit log that lets you reconstruct which subscriptions belonged to which session.
Tool design is where the user experience lives
The spec defines the shape of tools/list and tools/call. It does not tell you how to design tools that an LLM can use reliably. This is the single highest-leverage piece of MCP work and the one most often under-invested.
Patterns we converged on.
Output schemas matter as much as input schemas. The 2025-06-18 revision added Structured Tool Outputs — outputSchema on the tool definition. Use it. LLMs do better with structured outputs they can compose with than with prose blobs they have to re-parse. A tool that returns { "results": [{ "id": 1, "name": "Acme Corp" }], "total": 1 } lets the agent reason about the result; a tool that returns "Found 1 result: Acme Corp" forces the agent to re-extract structure.
Tool descriptions should optimise for the LLM, not for human documentation. The description field is in the prompt. Treat it as prompt engineering. Concrete examples, the typical use case in one sentence, the failure modes the model should expect. Wall-of-text descriptions inflate the prompt without improving model behaviour; sharply-written descriptions reduce both prompt size and tool-call error rate.
Annotations are useful, but treat the destructiveHint and readOnlyHint with suspicion. The June 2025 spec lets servers annotate tools with destructiveHint, readOnlyHint, idempotentHint. The spec is explicit that these are advisory and not security boundaries. We have seen agents that treat them as policy. They are not policy. If a tool is genuinely destructive, gate it behind explicit approval — the agent-action-approval-gates and OPA-for-agent-action patterns are the load-bearing layer here, not the annotation.
Elicitation is the most under-used feature in the protocol. The June 2025 spec added Elicitation — servers can request additional information from the user mid-interaction. This is the right pattern for "I need approval before doing this," "what timezone are you in," "did you mean Q4 2025 or Q4 2026." It avoids the agent inventing an answer to fill a gap in the user's prompt. Most server implementations have not yet adopted Elicitation. Adopting it pays for itself in fewer hallucinated parameters.
Versioning the protocol on every request
The June 2025 revision moved the protocol version from negotiation-only to required on every HTTP request, in the MCP-Protocol-Version header. This sounds defensive — it is the difference between intermittent compatibility breakage and reliable rejection. The pattern that works: clients send the version they negotiated at initialize, servers either accept (returning a response) or reject with 400 and an explicit error. No silent downgrades, no mismatched behaviour mid-session.
The operational corollary: if you are running multiple server versions in production behind a load balancer, the older versions need to reject newer protocol versions cleanly, not pretend to support them. Test the rejection paths. They are the failure mode that surfaces only when you do a partial upgrade.
What this teaches us about agent infrastructure
MCP is the cleanest agent-tool wire format the ecosystem has produced. It is also operationally honest in a way the spec doesn't always advertise: shipping it in production demands the same discipline as shipping any other production protocol — session lifecycle, transport tuning, auth that survives real OAuth providers, observability that lets you triage at 02:00. The spec does the architectural work. Operations does the rest.
Two takeaways for teams adopting MCP now.
One. Standardise on Streamable HTTP for everything remote, sticky sessions for everything stateful, and an OAuth provider you have configured correctly for Resource Indicators. The combinations of these three that work are smaller than the combinations on offer.
Two. Treat MCP as a substrate for tool composition, not as the destination architecture. The agent loop, the safety surface, the eval gate, the kill-switch — these are not in the MCP spec because they are not protocol concerns. They are your concerns. The protocol is a wire format. The architecture is everything else.
FAQs
What changed in the 2025-06-18 MCP revision and why does it matter for production?
The June 2025 revision removed JSON-RPC batching, mandated OAuth 2.1 with PKCE and Resource Indicators (RFC 8707), added Structured Tool Outputs and Elicitation, and pinned the protocol version on every HTTP request via the MCP-Protocol-Version header. The combined effect is that compatibility breakage is loud and explicit instead of silent, auth is finally tight enough to use across multi-server clients, and the wire format gives the LLM structured data it can reason about.
Why do streaming MCP responses hang under load balancers and proxies?
Cloudflare default behaviour, AWS ALB with the wrong idle-timeout, and nginx without proxy_buffering off all buffer the SSE stream until the connection closes, which collapses streaming into batch and looks like a thirty-second hang. Fix this with explicit no-buffer response headers, proxy configuration that respects them, and ALB idle timeout raised beyond your longest streaming response.
How do you stop MCP sessions from leaking server resources?
Tune the inactivity timer to the workload (we run 30 minutes for IDE clients, 5 minutes for ephemeral agent runs), set aggressive read-timeouts on the server's SSE write side with context cancellation propagation, and add session-cleanup hooks that release every subscription and downstream connection on session end. The subscription cascade is the leak that surfaces at hour eight as connection-pool exhaustion, not as memory pressure.
Are destructiveHint and readOnlyHint safe to use as security boundaries?
No. The June 2025 spec is explicit that these annotations are advisory, not policy. Treat them as model-hint metadata. For genuinely destructive tools, gate execution behind explicit approval workflows — the agent-action-approval-gates and OPA-for-agent-action patterns are the load-bearing layer, not the annotation.
Should we adopt Elicitation in our MCP servers?
Yes, and it is the most under-used feature in the protocol. Elicitation lets a server request additional information from the user mid-interaction — the right primitive for approval prompts, ambiguity resolution, and clarifying missing parameters. Without it, the model invents an answer to fill the gap. Adopting it pays for itself in fewer hallucinated parameters within a few weeks of production traffic.
Companion content
- Agent Action Approval Gates
- OPA for AI Agent Action Approval
- Self-Improving Agents: Production Pattern
- Multi-Agent Orchestration: CrewAI vs LangGraph vs Custom
- OpenClaw Architecture: MCP and RAG
How to engage
If your team is shipping MCP servers or clients and hitting the operational edges, we have shipped this in production and can shortcut the learning. Talk to us at creativeminds.dev/contact.
