Streaming LLM Responses Through Your Existing REST API: Patterns That Actually Work
Your frontend wants tokens as they generate. Your API gateway only speaks JSON. Here's how to bridge the two without rewriting the stack you spent two years getting stable.
A product manager watched a competitor's bot stream tokens in real time and asked why ours waits eight seconds and then dumps the whole answer. Your backend already has a REST API, a load balancer, an API gateway, and three middlewares that all buffer responses. None of that was designed for streaming. None of it wants to start.
Here's what's worked for me, including the part where nginx silently buffered everything and made me question my career choices.
The protocol question
Streaming LLM responses has three viable wire formats. WebSockets, Server-Sent Events (SSE), and chunked HTTP with NDJSON. Pick by what your infrastructure already supports, not what looks coolest on Twitter. SSE is usually the answer. It rides on plain HTTP, survives most proxies, and reconnects automatically. WebSockets are overkill for one-way streaming. NDJSON works but you'll write a custom client.
Think of SSE as a one-way phone line. The server keeps talking until it hangs up, the client just listens. That's exactly what an LLM token stream needs.
SSE in a Node Express handler
app.post("/api/chat/stream", async (req, res) => {
res.writeHead(200, {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache, no-transform",
Connection: "keep-alive",
"X-Accel-Buffering": "no", // disable nginx proxy buffering
});
const stream = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: req.body.messages,
stream: true,
});
try {
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) res.write(`data: ${JSON.stringify({ delta })}\n\n`);
}
res.write("data: [DONE]\n\n");
} catch (err) {
res.write(`event: error\ndata: ${JSON.stringify({ message: err.message })}\n\n`);
} finally {
res.end();
}
});
Why it works
X-Accel-Buffering: no is the magic line. Without it, nginx (and most reverse proxies) hold your response until it's "big enough." Took me two days of staring at logs to find that one. The data: prefix and double-newline are the SSE spec. Browsers parse them natively via EventSource. Wrapping each token in a JSON envelope means you can add metadata later (token counts, function calls, citations) without breaking the client.
When to use SSE
Standard chat UIs, agent traces, log streams. Anything where the server initiates and the client consumes. Default to SSE for chat features bolted onto an existing REST stack. You'll save weeks compared to introducing WebSockets and a brand new auth path.
When to reach for WebSockets instead
When the user needs to interrupt generation, send mid-stream context (a follow-up correction), or when you already run a WebSocket server for other features. Also when you're doing bidirectional voice or video in the same channel. Otherwise SSE is the lower-friction option, and the one your ops team will actually let you ship.
The data path
sequenceDiagram
participant C as Client (EventSource)
participant N as nginx / API gateway
participant S as Your API
participant L as LLM provider
C->>N: POST /api/chat/stream
N->>S: forward (no buffering)
S->>L: chat.completions.create(stream=true)
loop tokens
L-->>S: delta chunk
S-->>N: data: {"delta":"..."}
N-->>C: data: {"delta":"..."}
end
L-->>S: [DONE]
S-->>C: data: [DONE]
S->>S: close connectionConclusion
Before you write any streaming code, curl your endpoint through production with --no-buffer and confirm bytes flush in real time. If they don't, no client-side trick saves you. The proxy buffering is the actual fix you need to ship first. Don't ask how I know.