Short Answer
Use Server-Sent Events (SSE), not WebSockets, unless you specifically need bidirectional streams. Optimize for time-to-first-token — users perceive latency by when text starts appearing, not when it finishes. Buffer tool call arguments across chunks and only parse the JSON when the tool call completes. Send an explicit end-of-stream signal so the client can distinguish success from a dropped connection. And test your streaming with realistic proxies and load balancers in the path — many streaming bugs only surface in production because middleware buffers responses that streamed fine locally.
Streaming is the difference between an LLM app that feels fast and one that feels broken. A user watching text appear starting at 400 milliseconds thinks the model is quick, even if the full response takes 12 seconds to complete. A user staring at a spinner for 5 seconds waiting for a full response — and then getting it instantly — thinks the model is slow, even though the total wait was shorter.
The perception gap is enormous. This guide covers how to build streaming that actually works in production: which transport to pick, how to structure the server, how to handle tool calls that arrive as their own stream, how to survive dropped connections, and the specific mistakes that cause “streaming works locally, breaks in prod.”
Why Streaming Wins on Perceived Latency
LLM generation is autoregressive — the model produces one token at a time, sequentially. The total time to generate a 500-token response is roughly the same whether you stream or not. What changes is when the user sees the first token.
Compare two experiences of the same 8-second generation:
| Non-streamed | Streamed | |
|---|---|---|
| Time to first visible text | ~8s (full response arrives) | ~400ms (first tokens) |
| Time to full response | ~8s | ~8s |
| Perceived speed | Slow — frustrating spinner | Fast — instant feedback |
| Failure UX | User waits, then error | User sees partial output, retry from context |
The wall-clock time is identical. The perceived experience is not. This is why every serious chat interface streams — not because the model is faster, but because users experience the model as faster.
SSE vs WebSockets vs Chunked HTTP: Pick SSE
Three transports can stream LLM output: Server-Sent Events, WebSockets, and plain chunked HTTP responses. Nine times out of ten, you want SSE.
| Transport | Direction | When to use |
|---|---|---|
| SSE (Server-Sent Events) | Server → Client | Default for chat UIs. Native browser support via EventSource. Auto-reconnect. Works through most proxies. Simple to implement. |
| WebSockets | Bidirectional | Only when you need bidirectional streams — e.g. voice-in / voice-out interfaces, collaborative editing with LLM in the loop. |
| Chunked HTTP | Server → Client | Good for non-browser clients (mobile SDKs, CLI tools) or when you don’t need per-event framing. Lower ceremony than SSE. |
Why SSE is the default:
- Zero client-side dependencies. Every browser has
EventSource. On the server, it’s just an HTTP response withContent-Type: text/event-streamand specific formatting. - Auto-reconnect. Browsers automatically reconnect a dropped SSE stream. You still need to handle mid-generation cutoffs cleanly on the app layer, but the transport gives you resilience for free.
- Proxy-friendly. Because SSE looks like a normal HTTP response, most CDN and load-balancer configurations pass it through. WebSockets often require explicit upgrades and TCP-level plumbing.
- Debuggable. You can hit an SSE endpoint with curl and see the raw event stream. WebSockets require a WebSocket client to inspect.
WebSockets have real use cases — voice interfaces, live collaboration, anything where the client also needs to stream input to the server. But they introduce genuinely more failure modes: reconnection semantics, message ordering, connection state, keep-alives. Don’t reach for WebSockets unless you actually need bidirectional streams.
The single most common streaming bug: middleware buffers the response instead of streaming it. Your local dev server streams fine; behind an older Nginx config or certain CDN setups, the response is buffered until complete and the client sees no streaming at all. Fixes: set X-Accel-Buffering: no in the response, disable proxy buffering, verify with curl against the production endpoint.
Optimize for Time-to-First-Token, Not Total Time
The most important metric in an LLM UI is the time between the user submitting a prompt and the first visible token. Total generation time barely moves user perception once streaming has started. The first-token delay dominates.
Common causes of slow first-token latency:
- Cold model or cold worker. If your inference layer is scale-to-zero, the first request in a cold state pays a startup penalty of seconds. Keep hot workers around for interactive use cases.
- Long system prompts and RAG contexts. The model reads all input before producing any output. A 20,000-token system prompt is a 20,000-token delay to first token. Consider prompt caching if the same prefix repeats across requests.
- Provider queueing. Managed inference providers have latency variance based on demand. Set explicit timeouts and consider a fallback provider for cold-shed events.
- Server-side buffering. The classic. Your code accumulates the response before sending, even though the API is streaming. Flush after every token.
Send a lightweight “stream started” event as soon as your handler receives the connection — before the first model token arrives. This tells the client to render a “thinking” indicator with useful information (which tool is being called, which model is running) instead of just spinning. It also confirms the connection is healthy end-to-end.
Streaming Tool Calls: Buffer Arguments, Parse at the End
Tool calls (also called function calls) arrive as their own streamed chunks — not as a single “here’s the tool call” message. The model streams the tool name, then the arguments as a JSON string character by character, and finally a completion signal.
The mistake almost every team makes on their first LLM streaming implementation: trying to JSON-parse partial argument strings mid-stream. That gives you a parse error on every chunk except the last one, which either crashes your handler or generates a wall of error logs.
The correct pattern:
You can still update the UI as tool call chunks arrive — showing the tool name and a “calling...” state, for example. You just cannot execute the tool until the arguments string is complete and valid JSON. Trying to parse and execute before completion is how you get race conditions and half-run tools that corrupt your app state.
End-of-Stream Signals and Broken Connections
The default failure mode of streaming: the connection drops mid-generation and the user sees a half-generated response with no error. That’s a terrible UX and the most common streaming bug in production LLM apps.
Fixing it requires three specific things:
1. Send an explicit end-of-stream event
The client cannot tell the difference between “the model finished generating and closed the connection normally” and “the connection dropped mid-generation” without a signal from the server. Send one.
On the client, treat “connection closed without a done event” as a broken stream. Show the partial output with a clear error state and a retry option.
2. Handle timeouts on the client
Sometimes the server dies without dropping the TCP connection cleanly. The client sees an open connection with no incoming data. Set a per-chunk timeout — if you haven’t received a token in N seconds, treat the stream as broken and surface the error. Fifteen to thirty seconds is a reasonable default for most models; adjust based on your provider’s tail latency.
3. Preserve partial output for retry
When a stream breaks after producing some output, the user has already seen partial content. Retrying from scratch will produce different output (LLMs are non-deterministic) and feel jarring. Two patterns:
- Retry from context. Keep the assistant message state, append “continue from where you left off” to the conversation, and retry. The model produces a completion that continues the interrupted response.
- Restart with warning. Discard the partial output and restart cleanly, but tell the user: “connection interrupted, regenerating.” This is simpler and safer if partial output is confusing.
Which pattern is right depends on whether your users are producing short chat messages (restart is fine) or long-form content like code (retry-from-context is better).
Backpressure and Cancellation
LLM streaming is producer-driven — the model produces tokens as fast as it can, and your handler pushes them to the client. If the client is slow to consume, tokens back up in memory. If the user navigates away, the model is still generating and you’re paying for tokens nobody will read.
Cancellation on client disconnect
Every LLM streaming handler should propagate client disconnects to the model. If the user closes the tab, your server should cancel the in-flight generation immediately. In Node this is request.on('close'); in Python with an async framework it’s checking request.is_disconnected(); in Go it’s ctx.Done(). If you don’t propagate cancellation, you pay for hundreds of tokens the user never saw — and multiply that by concurrent users to see the impact on your inference bill.
Backpressure with an event queue
If your streaming layer accumulates events faster than the network can send them, memory fills up. In practice this rarely happens with a single user on a single stream, because network throughput is far higher than model throughput. But if you’re fanning out one generation to multiple consumers (broadcast chat, multi-user session), backpressure matters. Bound the buffer size and drop old events on overflow — a stale token is worse than a missed one.
Streaming Behind Workers and Queues
The naive pattern — HTTP request → worker → LLM stream → back to HTTP response — only works if your request handler holds the HTTP connection open the entire time. That doesn’t scale past a few hundred concurrent users and doesn’t survive worker restarts. Real production systems decouple the connection from the worker.
Two production patterns:
| Pattern | How it works | When to use |
|---|---|---|
| Pub/sub streaming layer | Long-lived connection layer subscribes to a channel per generation ID. Workers publish token events to the channel; the connection layer forwards them. | Interactive chat where sub-second latency matters |
| Durable event log | Worker writes tokens to a fast KV or event log (Redis Streams, Kafka, similar). Client subscribes and reads from position N. | Long-running jobs where resume-from-position matters more than latency |
Both patterns let workers restart without killing user sessions, let you horizontally scale the connection layer independently of workers, and let you support resume-from-position for very long generations. Both are also real infrastructure work — many teams reach for a managed streaming platform or a durable workflow engine to avoid building this themselves.
Rendering Streamed Markdown and Code Blocks
Rendering plain text as it streams is trivial — just append to a paragraph. Rendering markdown and syntax-highlighted code as it streams is harder, because partial markdown is often malformed.
The common bugs:
- Half-open code fences. The model streams
```javascriptbut not the closing fence yet. Naive markdown parsers render everything below as code until the end of the stream. - Partial lists. A partially-streamed
-may not yet be a valid list item to your renderer. - Broken links. Half-streamed
[text](without the closing paren. - Reflow flicker. Every re-render triggers a layout shift because the block just changed shape.
The patterns that work:
- Line-buffered rendering. Keep the last incomplete line as plain text; only apply markdown to complete lines. This is what most polished chat UIs do — watch how the “streaming” part is unstyled while the previous lines are already formatted.
- Smart code-fence detection. When a fence opens without a language, wait for the language on the same line before applying syntax highlighting. When a fence never closes and generation ends, treat it as complete.
- Debounced re-highlight. Syntax-highlighting on every token is expensive. Batch to every N tokens or every M milliseconds — the user won’t notice.
- CSS containment. Wrap the streaming message in
contain: layoutorcontent-visibility: autoto prevent re-layout from cascading up the page.
Observability for Streaming: What to Log
You cannot fix streaming bugs you can’t see. Baseline instrumentation for every LLM stream:
- Time to first token (TTFT). The primary UX metric. Track p50, p95, p99.
- Inter-token latency. Time between tokens after the first. High variance here means the stream is stuttering.
- Total generation time. Wall-clock time from prompt sent to done event.
- Disconnect reason. Why did the stream end? Normal completion, client disconnect, server-side timeout, provider error. Each one is a different failure class.
- Tokens generated vs delivered. If the model generated 800 tokens but the client only received 600, you had backpressure or a mid-stream disconnect.
- Model, provider, region. Slow streams are often provider-specific. You need these dimensions to debug.
Instrument these before you have a problem. Reactive instrumentation added after a production incident is always missing the exact dimension you need for the debug.
The Five Mistakes That Break Streaming in Production
- Middleware buffering the response. Works locally, breaks behind Nginx / CDN / load balancer. Disable buffering explicitly at every layer.
- Parsing partial tool call JSON. Buffer arguments across chunks. Parse only when the tool call completes.
- No end-of-stream signal. The client can’t tell success from a broken connection. Send an explicit done event.
- No client disconnect propagation. Users navigate away, model keeps generating, you pay for tokens nobody sees. Cancel on disconnect.
- Rendering markdown chunk-by-chunk. Half-open code fences and lists render as garbage. Line-buffer the final line; render complete lines as markdown.
Streaming Is UX Infrastructure
Streaming isn’t about making generation faster. Total tokens per second is unchanged. Streaming is about collapsing the “is anything happening?” window from seconds to milliseconds. That’s a UX transformation, and the difference between a chat app users trust and one they leave. Get the transport right, get the tool call handling right, and get the end-of-stream signal right — and streaming stops being the thing that breaks in production.
Companies building serious LLM systems
Browse open ML/AI engineering roles at companies actually shipping LLM apps to production — not just AI startups you’ve read about, but the teams whose infrastructure choices you’re making right now.
Browse ML/AI roles → Browse AI tools & skills →