Should I use Server-Sent Events (SSE) or WebSockets for streaming?

Default to SSE for LLM streaming. SSE is simpler (one-way server-to-client), works natively in browsers via EventSource, degrades cleanly through HTTP proxies and load balancers, and matches the mental model of LLM output: the server sends chunks, the client renders them. WebSockets are overkill unless you specifically need bidirectional interaction — for example, a voice interface where the client streams audio input while the server streams tokens back. WebSockets also introduce genuinely more failure modes: reconnection logic, message ordering, connection state to manage. For 90% of chat UIs, SSE is the right default. The main gotcha with SSE is that some middleware (older proxies, some CDN configs) buffer responses instead of streaming them — you have to explicitly disable buffering at every layer.

How do I handle streaming when the LLM is behind a queue or worker?

This is where streaming architectures get harder. The naive pattern — 'HTTP request → worker → LLM stream → back to HTTP response' — only works if your request handler holds the connection open the entire time. That doesn't scale past a few hundred concurrent users and doesn't survive worker restarts. The two production patterns are: (1) Long-lived connection layer (SSE server or WebSocket server) that subscribes to a pub/sub channel per generation ID. Workers publish token events to the channel; the connection layer forwards them to the client. This decouples the worker from the client connection and lets workers restart without dropping user requests. (2) Client-polled ephemeral storage — the worker writes token chunks to a fast key-value store (Redis Streams, Kafka, or similar) and the client polls or subscribes for updates. Pattern 1 is lower-latency and preferred for interactive chat; pattern 2 is more resilient and preferred for very long-running jobs. Building this yourself is real infrastructure work — many teams use a managed streaming SDK or a durable workflow platform to skip it.

Streaming LLM Responses: The Complete Engineering Guide (2026)

Q: Why should I stream LLM responses instead of returning them all at once?

Two reasons: perceived latency and infrastructure ergonomics. A user staring at a spinner for 8 seconds waiting for a full response experiences the app as slow, even when the total generation time is exactly the same as streaming. A user watching text stream in starting at 400 milliseconds experiences the app as fast, because the wait-for-something time collapses. Beyond UX, streaming also lets you handle very long responses that would otherwise time out at the HTTP layer, gives you a natural place to cancel in-flight generations when the user navigates away, and lets your frontend start rendering markdown and code blocks incrementally rather than parsing a huge blob at once. The only reason not to stream is if you specifically need the full response server-side before doing anything with it — for example, running a JSON validator on structured output before returning to the client.

Q: How do I stream tool calls from an LLM?

Tool calls arrive as their own streamed chunks — not as a single 'here's the tool call' message. The model streams the tool name first, then the arguments as JSON, character by character, and finally a signal that the tool call is complete. Your streaming handler needs to accumulate the arguments string across chunks and parse it only when the tool call completes. Do not try to JSON-parse partial argument strings mid-stream — you'll get parse errors on every chunk except the last. The common pattern is to buffer tool call chunks into a per-call state object keyed by index, then execute the tool when a 'tool call finished' event arrives. You can still render the tool call in the UI as it streams — for example, showing the tool name and a progress indicator — you just can't execute it until it's complete.

Q: What happens if the stream breaks mid-response?

The default failure mode is that the user sees a half-generated response and no error. That's a terrible UX and the single most common streaming bug in production LLM apps. The fix has three parts: (1) Send an explicit end-of-stream signal so the client knows the difference between 'connection closed after successful completion' and 'connection dropped mid-generation'. (2) On the client, detect a broken stream (connection close without end-of-stream, or a timeout since the last chunk) and show a clear error state with a retry option. (3) Optionally, implement resume-from-position for very long responses — most streaming APIs don't natively support this, but you can approximate it by keeping the full assistant message in state so you can retry from where you left off. Also: instrument this. You cannot fix broken streams you can't see. Log every stream disconnect event with a reason code.

Short Answer

Use Server-Sent Events (SSE), not WebSockets, unless you specifically need bidirectional streams. Optimize for time-to-first-token — users perceive latency by when text starts appearing, not when it finishes. Buffer tool call arguments across chunks and only parse the JSON when the tool call completes. Send an explicit end-of-stream signal so the client can distinguish success from a dropped connection. And test your streaming with realistic proxies and load balancers in the path — many streaming bugs only surface in production because middleware buffers responses that streamed fine locally.

Streaming is the difference between an LLM app that feels fast and one that feels broken. A user watching text appear starting at 400 milliseconds thinks the model is quick, even if the full response takes 12 seconds to complete. A user staring at a spinner for 5 seconds waiting for a full response — and then getting it instantly — thinks the model is slow, even though the total wait was shorter.

The perception gap is enormous. This guide covers how to build streaming that actually works in production: which transport to pick, how to structure the server, how to handle tool calls that arrive as their own stream, how to survive dropped connections, and the specific mistakes that cause “streaming works locally, breaks in prod.”

Why Streaming Wins on Perceived Latency

LLM generation is autoregressive — the model produces one token at a time, sequentially. The total time to generate a 500-token response is roughly the same whether you stream or not. What changes is when the user sees the first token.

Compare two experiences of the same 8-second generation:

	Non-streamed	Streamed
Time to first visible text	~8s (full response arrives)	~400ms (first tokens)
Time to full response	~8s	~8s
Perceived speed	Slow — frustrating spinner	Fast — instant feedback
Failure UX	User waits, then error	User sees partial output, retry from context

The wall-clock time is identical. The perceived experience is not. This is why every serious chat interface streams — not because the model is faster, but because users experience the model as faster.

SSE vs WebSockets vs Chunked HTTP: Pick SSE

Three transports can stream LLM output: Server-Sent Events, WebSockets, and plain chunked HTTP responses. Nine times out of ten, you want SSE.

Transport	Direction	When to use
SSE (Server-Sent Events)	Server → Client	Default for chat UIs. Native browser support via EventSource. Auto-reconnect. Works through most proxies. Simple to implement.
WebSockets	Bidirectional	Only when you need bidirectional streams — e.g. voice-in / voice-out interfaces, collaborative editing with LLM in the loop.
Chunked HTTP	Server → Client	Good for non-browser clients (mobile SDKs, CLI tools) or when you don’t need per-event framing. Lower ceremony than SSE.

Why SSE is the default:

Zero client-side dependencies. Every browser has EventSource. On the server, it’s just an HTTP response with Content-Type: text/event-stream and specific formatting.
Auto-reconnect. Browsers automatically reconnect a dropped SSE stream. You still need to handle mid-generation cutoffs cleanly on the app layer, but the transport gives you resilience for free.
Proxy-friendly. Because SSE looks like a normal HTTP response, most CDN and load-balancer configurations pass it through. WebSockets often require explicit upgrades and TCP-level plumbing.
Debuggable. You can hit an SSE endpoint with curl and see the raw event stream. WebSockets require a WebSocket client to inspect.

WebSockets have real use cases — voice interfaces, live collaboration, anything where the client also needs to stream input to the server. But they introduce genuinely more failure modes: reconnection semantics, message ordering, connection state, keep-alives. Don’t reach for WebSockets unless you actually need bidirectional streams.

Watch out — the “works locally, broken in prod” classic

The single most common streaming bug: middleware buffers the response instead of streaming it. Your local dev server streams fine; behind an older Nginx config or certain CDN setups, the response is buffered until complete and the client sees no streaming at all. Fixes: set X-Accel-Buffering: no in the response, disable proxy buffering, verify with curl against the production endpoint.

Optimize for Time-to-First-Token, Not Total Time

The most important metric in an LLM UI is the time between the user submitting a prompt and the first visible token. Total generation time barely moves user perception once streaming has started. The first-token delay dominates.

Common causes of slow first-token latency:

Cold model or cold worker. If your inference layer is scale-to-zero, the first request in a cold state pays a startup penalty of seconds. Keep hot workers around for interactive use cases.
Long system prompts and RAG contexts. The model reads all input before producing any output. A 20,000-token system prompt is a 20,000-token delay to first token. Consider prompt caching if the same prefix repeats across requests.
Provider queueing. Managed inference providers have latency variance based on demand. Set explicit timeouts and consider a fallback provider for cold-shed events.
Server-side buffering. The classic. Your code accumulates the response before sending, even though the API is streaming. Flush after every token.

Tip

Send a lightweight “stream started” event as soon as your handler receives the connection — before the first model token arrives. This tells the client to render a “thinking” indicator with useful information (which tool is being called, which model is running) instead of just spinning. It also confirms the connection is healthy end-to-end.

Streaming Tool Calls: Buffer Arguments, Parse at the End

Tool calls (also called function calls) arrive as their own streamed chunks — not as a single “here’s the tool call” message. The model streams the tool name, then the arguments as a JSON string character by character, and finally a completion signal.

The mistake almost every team makes on their first LLM streaming implementation: trying to JSON-parse partial argument strings mid-stream. That gives you a parse error on every chunk except the last one, which either crashes your handler or generates a wall of error logs.

The correct pattern:

// Pseudocode — accumulate tool call chunks by index
const toolCalls = {}; // keyed by tool_call.index

for await (const chunk of stream) {
  const delta = chunk.choices[0].delta;

  if (delta.tool_calls) {
    for (const tc of delta.tool_calls) {
      const idx = tc.index;
      if (!toolCalls[idx]) toolCalls[idx] = { name: '', args: '' };
      if (tc.function?.name) toolCalls[idx].name += tc.function.name;
      if (tc.function?.arguments) toolCalls[idx].args += tc.function.arguments;
    }
  }

  if (chunk.choices[0].finish_reason === 'tool_calls') {
    // Now — and only now — parse arguments and execute tools
    for (const tc of Object.values(toolCalls)) {
      const parsed = JSON.parse(tc.args);
      await executeTool(tc.name, parsed);
    }
  }
}

You can still update the UI as tool call chunks arrive — showing the tool name and a “calling...” state, for example. You just cannot execute the tool until the arguments string is complete and valid JSON. Trying to parse and execute before completion is how you get race conditions and half-run tools that corrupt your app state.

Hiring engineers who build production AI systems? Post AI/ML roles on JBC →

End-of-Stream Signals and Broken Connections

The default failure mode of streaming: the connection drops mid-generation and the user sees a half-generated response with no error. That’s a terrible UX and the most common streaming bug in production LLM apps.

Fixing it requires three specific things:

1. Send an explicit end-of-stream event

The client cannot tell the difference between “the model finished generating and closed the connection normally” and “the connection dropped mid-generation” without a signal from the server. Send one.

// SSE format — server sends a final “done” event
event: message
data: {"delta": "final tokens here"}

event: done
data: {"reason": "stop", "usage": {"total_tokens": 421}}

On the client, treat “connection closed without a done event” as a broken stream. Show the partial output with a clear error state and a retry option.

2. Handle timeouts on the client

Sometimes the server dies without dropping the TCP connection cleanly. The client sees an open connection with no incoming data. Set a per-chunk timeout — if you haven’t received a token in N seconds, treat the stream as broken and surface the error. Fifteen to thirty seconds is a reasonable default for most models; adjust based on your provider’s tail latency.

3. Preserve partial output for retry

When a stream breaks after producing some output, the user has already seen partial content. Retrying from scratch will produce different output (LLMs are non-deterministic) and feel jarring. Two patterns:

Retry from context. Keep the assistant message state, append “continue from where you left off” to the conversation, and retry. The model produces a completion that continues the interrupted response.
Restart with warning. Discard the partial output and restart cleanly, but tell the user: “connection interrupted, regenerating.” This is simpler and safer if partial output is confusing.

Which pattern is right depends on whether your users are producing short chat messages (restart is fine) or long-form content like code (retry-from-context is better).

Backpressure and Cancellation

LLM streaming is producer-driven — the model produces tokens as fast as it can, and your handler pushes them to the client. If the client is slow to consume, tokens back up in memory. If the user navigates away, the model is still generating and you’re paying for tokens nobody will read.

Cancellation on client disconnect

Every LLM streaming handler should propagate client disconnects to the model. If the user closes the tab, your server should cancel the in-flight generation immediately. In Node this is request.on('close'); in Python with an async framework it’s checking request.is_disconnected(); in Go it’s ctx.Done(). If you don’t propagate cancellation, you pay for hundreds of tokens the user never saw — and multiply that by concurrent users to see the impact on your inference bill.

Backpressure with an event queue

If your streaming layer accumulates events faster than the network can send them, memory fills up. In practice this rarely happens with a single user on a single stream, because network throughput is far higher than model throughput. But if you’re fanning out one generation to multiple consumers (broadcast chat, multi-user session), backpressure matters. Bound the buffer size and drop old events on overflow — a stale token is worse than a missed one.

Streaming Behind Workers and Queues

The naive pattern — HTTP request → worker → LLM stream → back to HTTP response — only works if your request handler holds the HTTP connection open the entire time. That doesn’t scale past a few hundred concurrent users and doesn’t survive worker restarts. Real production systems decouple the connection from the worker.

Two production patterns:

Pattern	How it works	When to use
Pub/sub streaming layer	Long-lived connection layer subscribes to a channel per generation ID. Workers publish token events to the channel; the connection layer forwards them.	Interactive chat where sub-second latency matters
Durable event log	Worker writes tokens to a fast KV or event log (Redis Streams, Kafka, similar). Client subscribes and reads from position N.	Long-running jobs where resume-from-position matters more than latency

Both patterns let workers restart without killing user sessions, let you horizontally scale the connection layer independently of workers, and let you support resume-from-position for very long generations. Both are also real infrastructure work — many teams reach for a managed streaming platform or a durable workflow engine to avoid building this themselves.

Rendering Streamed Markdown and Code Blocks

Rendering plain text as it streams is trivial — just append to a paragraph. Rendering markdown and syntax-highlighted code as it streams is harder, because partial markdown is often malformed.

The common bugs:

Half-open code fences. The model streams ```javascript but not the closing fence yet. Naive markdown parsers render everything below as code until the end of the stream.
Partial lists. A partially-streamed - may not yet be a valid list item to your renderer.
Broken links. Half-streamed [text]( without the closing paren.
Reflow flicker. Every re-render triggers a layout shift because the block just changed shape.

The patterns that work:

Line-buffered rendering. Keep the last incomplete line as plain text; only apply markdown to complete lines. This is what most polished chat UIs do — watch how the “streaming” part is unstyled while the previous lines are already formatted.
Smart code-fence detection. When a fence opens without a language, wait for the language on the same line before applying syntax highlighting. When a fence never closes and generation ends, treat it as complete.
Debounced re-highlight. Syntax-highlighting on every token is expensive. Batch to every N tokens or every M milliseconds — the user won’t notice.
CSS containment. Wrap the streaming message in contain: layout or content-visibility: auto to prevent re-layout from cascading up the page.

Observability for Streaming: What to Log

You cannot fix streaming bugs you can’t see. Baseline instrumentation for every LLM stream:

Time to first token (TTFT). The primary UX metric. Track p50, p95, p99.
Inter-token latency. Time between tokens after the first. High variance here means the stream is stuttering.
Total generation time. Wall-clock time from prompt sent to done event.
Disconnect reason. Why did the stream end? Normal completion, client disconnect, server-side timeout, provider error. Each one is a different failure class.
Tokens generated vs delivered. If the model generated 800 tokens but the client only received 600, you had backpressure or a mid-stream disconnect.
Model, provider, region. Slow streams are often provider-specific. You need these dimensions to debug.

Instrument these before you have a problem. Reactive instrumentation added after a production incident is always missing the exact dimension you need for the debug.

The Five Mistakes That Break Streaming in Production

Middleware buffering the response. Works locally, breaks behind Nginx / CDN / load balancer. Disable buffering explicitly at every layer.
Parsing partial tool call JSON. Buffer arguments across chunks. Parse only when the tool call completes.
No end-of-stream signal. The client can’t tell success from a broken connection. Send an explicit done event.
No client disconnect propagation. Users navigate away, model keeps generating, you pay for tokens nobody sees. Cancel on disconnect.
Rendering markdown chunk-by-chunk. Half-open code fences and lists render as garbage. Line-buffer the final line; render complete lines as markdown.

Streaming Is UX Infrastructure

Streaming isn’t about making generation faster. Total tokens per second is unchanged. Streaming is about collapsing the “is anything happening?” window from seconds to milliseconds. That’s a UX transformation, and the difference between a chat app users trust and one they leave. Get the transport right, get the tool call handling right, and get the end-of-stream signal right — and streaming stops being the thing that breaks in production.

Companies building serious LLM systems

Browse open ML/AI engineering roles at companies actually shipping LLM apps to production — not just AI startups you’ve read about, but the teams whose infrastructure choices you’re making right now.

Browse ML/AI roles → Browse AI tools & skills →

Frequently Asked Questions

Why should I stream LLM responses instead of returning them all at once?+

Perceived latency. Total generation time is the same either way, but streaming collapses time-to-first-visible-text from seconds to milliseconds. Users experience streaming apps as far faster even when the wall-clock generation is identical. Streaming also handles long responses that would time out at the HTTP layer, and gives you a natural cancellation point when the user navigates away.

Should I use Server-Sent Events (SSE) or WebSockets?+

Default to SSE. Simpler (one-way server-to-client), works natively in browsers via EventSource, degrades through HTTP proxies cleanly. WebSockets are overkill unless you specifically need bidirectional interaction like a voice interface. For 90% of chat UIs, SSE is the right default. Common SSE gotcha: some middleware buffers responses — you have to explicitly disable buffering at every layer.

How do I stream tool calls from an LLM?+

Tool calls arrive as their own streamed chunks. The model streams the tool name, then the arguments as JSON character by character, then a completion signal. Accumulate arguments across chunks and parse them only when the tool call completes. Never try to JSON-parse partial argument strings — you’ll get parse errors on every chunk except the last.

What happens if the stream breaks mid-response?+

Default failure mode is that the user sees a half-generated response with no error. Fix in three parts: send an explicit end-of-stream signal so the client knows when generation completed cleanly, detect broken streams (connection close without end-of-stream, or a chunk timeout) and show a retry option, and optionally implement resume-from-position for long responses.

How do I handle streaming behind a queue or worker?+

Naive “HTTP request → worker → LLM stream → back to HTTP response” only works at small scale and doesn’t survive worker restarts. Production patterns decouple the connection from the worker: either a long-lived connection layer that subscribes to a pub/sub channel per generation ID, or a durable event log that clients read from position N. Building this yourself is real infrastructure work — many teams use a managed streaming SDK or workflow platform to skip it.