Short Answer

Use Server-Sent Events (SSE), not WebSockets, unless you specifically need bidirectional streams. Optimize for time-to-first-token — users perceive latency by when text starts appearing, not when it finishes. Buffer tool call arguments across chunks and only parse the JSON when the tool call completes. Send an explicit end-of-stream signal so the client can distinguish success from a dropped connection. And test your streaming with realistic proxies and load balancers in the path — many streaming bugs only surface in production because middleware buffers responses that streamed fine locally.

Streaming is the difference between an LLM app that feels fast and one that feels broken. A user watching text appear starting at 400 milliseconds thinks the model is quick, even if the full response takes 12 seconds to complete. A user staring at a spinner for 5 seconds waiting for a full response — and then getting it instantly — thinks the model is slow, even though the total wait was shorter.

The perception gap is enormous. This guide covers how to build streaming that actually works in production: which transport to pick, how to structure the server, how to handle tool calls that arrive as their own stream, how to survive dropped connections, and the specific mistakes that cause “streaming works locally, breaks in prod.”

Why Streaming Wins on Perceived Latency

LLM generation is autoregressive — the model produces one token at a time, sequentially. The total time to generate a 500-token response is roughly the same whether you stream or not. What changes is when the user sees the first token.

Compare two experiences of the same 8-second generation:

Non-streamed Streamed
Time to first visible text ~8s (full response arrives) ~400ms (first tokens)
Time to full response ~8s ~8s
Perceived speed Slow — frustrating spinner Fast — instant feedback
Failure UX User waits, then error User sees partial output, retry from context

The wall-clock time is identical. The perceived experience is not. This is why every serious chat interface streams — not because the model is faster, but because users experience the model as faster.

SSE vs WebSockets vs Chunked HTTP: Pick SSE

Three transports can stream LLM output: Server-Sent Events, WebSockets, and plain chunked HTTP responses. Nine times out of ten, you want SSE.

Transport Direction When to use
SSE (Server-Sent Events) Server → Client Default for chat UIs. Native browser support via EventSource. Auto-reconnect. Works through most proxies. Simple to implement.
WebSockets Bidirectional Only when you need bidirectional streams — e.g. voice-in / voice-out interfaces, collaborative editing with LLM in the loop.
Chunked HTTP Server → Client Good for non-browser clients (mobile SDKs, CLI tools) or when you don’t need per-event framing. Lower ceremony than SSE.

Why SSE is the default:

WebSockets have real use cases — voice interfaces, live collaboration, anything where the client also needs to stream input to the server. But they introduce genuinely more failure modes: reconnection semantics, message ordering, connection state, keep-alives. Don’t reach for WebSockets unless you actually need bidirectional streams.

Watch out — the “works locally, broken in prod” classic

The single most common streaming bug: middleware buffers the response instead of streaming it. Your local dev server streams fine; behind an older Nginx config or certain CDN setups, the response is buffered until complete and the client sees no streaming at all. Fixes: set X-Accel-Buffering: no in the response, disable proxy buffering, verify with curl against the production endpoint.

Optimize for Time-to-First-Token, Not Total Time

The most important metric in an LLM UI is the time between the user submitting a prompt and the first visible token. Total generation time barely moves user perception once streaming has started. The first-token delay dominates.

Common causes of slow first-token latency:

Tip

Send a lightweight “stream started” event as soon as your handler receives the connection — before the first model token arrives. This tells the client to render a “thinking” indicator with useful information (which tool is being called, which model is running) instead of just spinning. It also confirms the connection is healthy end-to-end.

Streaming Tool Calls: Buffer Arguments, Parse at the End

Tool calls (also called function calls) arrive as their own streamed chunks — not as a single “here’s the tool call” message. The model streams the tool name, then the arguments as a JSON string character by character, and finally a completion signal.

The mistake almost every team makes on their first LLM streaming implementation: trying to JSON-parse partial argument strings mid-stream. That gives you a parse error on every chunk except the last one, which either crashes your handler or generates a wall of error logs.

The correct pattern:

// Pseudocode — accumulate tool call chunks by index const toolCalls = {}; // keyed by tool_call.index for await (const chunk of stream) { const delta = chunk.choices[0].delta; if (delta.tool_calls) { for (const tc of delta.tool_calls) { const idx = tc.index; if (!toolCalls[idx]) toolCalls[idx] = { name: '', args: '' }; if (tc.function?.name) toolCalls[idx].name += tc.function.name; if (tc.function?.arguments) toolCalls[idx].args += tc.function.arguments; } } if (chunk.choices[0].finish_reason === 'tool_calls') { // Now — and only now — parse arguments and execute tools for (const tc of Object.values(toolCalls)) { const parsed = JSON.parse(tc.args); await executeTool(tc.name, parsed); } } }

You can still update the UI as tool call chunks arrive — showing the tool name and a “calling...” state, for example. You just cannot execute the tool until the arguments string is complete and valid JSON. Trying to parse and execute before completion is how you get race conditions and half-run tools that corrupt your app state.

Hiring engineers who build production AI systems? Post AI/ML roles on JBC →

End-of-Stream Signals and Broken Connections

The default failure mode of streaming: the connection drops mid-generation and the user sees a half-generated response with no error. That’s a terrible UX and the most common streaming bug in production LLM apps.

Fixing it requires three specific things:

1. Send an explicit end-of-stream event

The client cannot tell the difference between “the model finished generating and closed the connection normally” and “the connection dropped mid-generation” without a signal from the server. Send one.

// SSE format — server sends a final “done” event event: message data: {"delta": "final tokens here"} event: done data: {"reason": "stop", "usage": {"total_tokens": 421}}

On the client, treat “connection closed without a done event” as a broken stream. Show the partial output with a clear error state and a retry option.

2. Handle timeouts on the client

Sometimes the server dies without dropping the TCP connection cleanly. The client sees an open connection with no incoming data. Set a per-chunk timeout — if you haven’t received a token in N seconds, treat the stream as broken and surface the error. Fifteen to thirty seconds is a reasonable default for most models; adjust based on your provider’s tail latency.

3. Preserve partial output for retry

When a stream breaks after producing some output, the user has already seen partial content. Retrying from scratch will produce different output (LLMs are non-deterministic) and feel jarring. Two patterns:

Which pattern is right depends on whether your users are producing short chat messages (restart is fine) or long-form content like code (retry-from-context is better).

Backpressure and Cancellation

LLM streaming is producer-driven — the model produces tokens as fast as it can, and your handler pushes them to the client. If the client is slow to consume, tokens back up in memory. If the user navigates away, the model is still generating and you’re paying for tokens nobody will read.

Cancellation on client disconnect

Every LLM streaming handler should propagate client disconnects to the model. If the user closes the tab, your server should cancel the in-flight generation immediately. In Node this is request.on('close'); in Python with an async framework it’s checking request.is_disconnected(); in Go it’s ctx.Done(). If you don’t propagate cancellation, you pay for hundreds of tokens the user never saw — and multiply that by concurrent users to see the impact on your inference bill.

Backpressure with an event queue

If your streaming layer accumulates events faster than the network can send them, memory fills up. In practice this rarely happens with a single user on a single stream, because network throughput is far higher than model throughput. But if you’re fanning out one generation to multiple consumers (broadcast chat, multi-user session), backpressure matters. Bound the buffer size and drop old events on overflow — a stale token is worse than a missed one.

Streaming Behind Workers and Queues

The naive pattern — HTTP request → worker → LLM stream → back to HTTP response — only works if your request handler holds the HTTP connection open the entire time. That doesn’t scale past a few hundred concurrent users and doesn’t survive worker restarts. Real production systems decouple the connection from the worker.

Two production patterns:

Pattern How it works When to use
Pub/sub streaming layer Long-lived connection layer subscribes to a channel per generation ID. Workers publish token events to the channel; the connection layer forwards them. Interactive chat where sub-second latency matters
Durable event log Worker writes tokens to a fast KV or event log (Redis Streams, Kafka, similar). Client subscribes and reads from position N. Long-running jobs where resume-from-position matters more than latency

Both patterns let workers restart without killing user sessions, let you horizontally scale the connection layer independently of workers, and let you support resume-from-position for very long generations. Both are also real infrastructure work — many teams reach for a managed streaming platform or a durable workflow engine to avoid building this themselves.

Rendering Streamed Markdown and Code Blocks

Rendering plain text as it streams is trivial — just append to a paragraph. Rendering markdown and syntax-highlighted code as it streams is harder, because partial markdown is often malformed.

The common bugs:

The patterns that work:

Observability for Streaming: What to Log

You cannot fix streaming bugs you can’t see. Baseline instrumentation for every LLM stream:

Instrument these before you have a problem. Reactive instrumentation added after a production incident is always missing the exact dimension you need for the debug.

The Five Mistakes That Break Streaming in Production

  1. Middleware buffering the response. Works locally, breaks behind Nginx / CDN / load balancer. Disable buffering explicitly at every layer.
  2. Parsing partial tool call JSON. Buffer arguments across chunks. Parse only when the tool call completes.
  3. No end-of-stream signal. The client can’t tell success from a broken connection. Send an explicit done event.
  4. No client disconnect propagation. Users navigate away, model keeps generating, you pay for tokens nobody sees. Cancel on disconnect.
  5. Rendering markdown chunk-by-chunk. Half-open code fences and lists render as garbage. Line-buffer the final line; render complete lines as markdown.

Streaming Is UX Infrastructure

Streaming isn’t about making generation faster. Total tokens per second is unchanged. Streaming is about collapsing the “is anything happening?” window from seconds to milliseconds. That’s a UX transformation, and the difference between a chat app users trust and one they leave. Get the transport right, get the tool call handling right, and get the end-of-stream signal right — and streaming stops being the thing that breaks in production.

Companies building serious LLM systems

Browse open ML/AI engineering roles at companies actually shipping LLM apps to production — not just AI startups you’ve read about, but the teams whose infrastructure choices you’re making right now.

Browse ML/AI roles → Browse AI tools & skills →

Frequently Asked Questions

Why should I stream LLM responses instead of returning them all at once?+
Perceived latency. Total generation time is the same either way, but streaming collapses time-to-first-visible-text from seconds to milliseconds. Users experience streaming apps as far faster even when the wall-clock generation is identical. Streaming also handles long responses that would time out at the HTTP layer, and gives you a natural cancellation point when the user navigates away.
Should I use Server-Sent Events (SSE) or WebSockets?+
Default to SSE. Simpler (one-way server-to-client), works natively in browsers via EventSource, degrades through HTTP proxies cleanly. WebSockets are overkill unless you specifically need bidirectional interaction like a voice interface. For 90% of chat UIs, SSE is the right default. Common SSE gotcha: some middleware buffers responses — you have to explicitly disable buffering at every layer.
How do I stream tool calls from an LLM?+
Tool calls arrive as their own streamed chunks. The model streams the tool name, then the arguments as JSON character by character, then a completion signal. Accumulate arguments across chunks and parse them only when the tool call completes. Never try to JSON-parse partial argument strings — you’ll get parse errors on every chunk except the last.
What happens if the stream breaks mid-response?+
Default failure mode is that the user sees a half-generated response with no error. Fix in three parts: send an explicit end-of-stream signal so the client knows when generation completed cleanly, detect broken streams (connection close without end-of-stream, or a chunk timeout) and show a retry option, and optionally implement resume-from-position for long responses.
How do I handle streaming behind a queue or worker?+
Naive “HTTP request → worker → LLM stream → back to HTTP response” only works at small scale and doesn’t survive worker restarts. Production patterns decouple the connection from the worker: either a long-lived connection layer that subscribes to a pub/sub channel per generation ID, or a durable event log that clients read from position N. Building this yourself is real infrastructure work — many teams use a managed streaming SDK or workflow platform to skip it.