Streaming Responses (SSE): Real-Time Token Delivery for AI Applications


Your AI chatbot takes 4 seconds to generate a 500-token response. Without streaming, the user sees nothing for 4 seconds, then the entire response appears at once. They wonder if the app is broken. Some click away. The perceived latency is 4 seconds.

With streaming, the first token appears after 200ms. The user sees words flowing in real-time, like someone typing. They start reading immediately. The total generation time is still 4 seconds, but the perceived latency is 200ms. The user experience transforms from “is this thing working?” to “this is fast and responsive.”

Streaming is not an optimization of generation speed. It is a UX pattern that masks latency by delivering partial results as they are produced. For LLM applications, where generation takes 2-30 seconds depending on response length, streaming is the difference between a usable product and an abandoned one.

What streaming actually is

LLMs generate tokens sequentially - one at a time. Without streaming, the API buffers all tokens and returns the complete response when generation finishes. With streaming, each token (or small batch of tokens) is sent to the client immediately after generation.

The protocol most commonly used is Server-Sent Events (SSE) - a simple HTTP-based protocol for server-to-client streaming over a single connection.

graph LR
  subgraph nostream["Without Streaming"]
      N1["Request"] --> N2["Wait 4s..."]
      N2 --> N3["Complete response
(all at once)"]
  end
  subgraph stream["With Streaming"]
      S1["Request"] --> S2["Token 1 (200ms)"]
      S2 --> S3["Token 2 (220ms)"]
      S3 --> S4["Token 3 (240ms)"]
      S4 --> S5["... tokens flow ..."]
      S5 --> S6["Done (4s total)"]
  end

  style N2 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style S2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style S3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style S4 fill:#E1F5EE,stroke:#0F6E56,color:#085041

How SSE works

Server-Sent Events is a W3C standard for unidirectional server-to-client communication over HTTP. The client makes a standard HTTP request, and the server holds the connection open, sending data chunks as they become available.

The protocol format

HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

data: {"id":"chatcmpl-123","choices":[{"delta":{"content":"Hello"}}]}

data: {"id":"chatcmpl-123","choices":[{"delta":{"content":" world"}}]}

data: {"id":"chatcmpl-123","choices":[{"delta":{"content":"!"}}]}

data: [DONE]

Each chunk is prefixed with data: and terminated with \n\n. The [DONE] sentinel signals stream completion.

Client-side implementation

// Browser: using EventSource
const eventSource = new EventSource('/api/chat?message=hello');
eventSource.onmessage = (event) => {
  if (event.data === '[DONE]') {
    eventSource.close();
    return;
  }
  const chunk = JSON.parse(event.data);
  const token = chunk.choices[0].delta.content;
  appendToUI(token);
};

// Or using fetch with ReadableStream (more control)
const response = await fetch('/api/chat', {
  method: 'POST',
  body: JSON.stringify({ message: 'hello', stream: true }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const text = decoder.decode(value);
  // Parse SSE format and extract tokens
  processSSEChunk(text);
}

Server-side implementation

# FastAPI streaming endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import openai

app = FastAPI()

@app.post("/api/chat")
async def chat(request: ChatRequest):
    async def generate():
        stream = await openai.chat.completions.create(
            model="gpt-4o",
            messages=request.messages,
            stream=True,
        )
        async for chunk in stream:
            if chunk.choices[0].delta.content:
                yield f"data: {chunk.model_dump_json()}\n\n"
        yield "data: [DONE]\n\n"
    
    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "Connection": "keep-alive"}
    )

Streaming with tool calls

Tool calling complicates streaming because the model might generate a tool call mid-stream. The client needs to handle both text tokens and tool call events:

async for chunk in stream:
    delta = chunk.choices[0].delta
    
    if delta.content:
        # Regular text token - send to client
        yield format_sse({"type": "text", "content": delta.content})
    
    elif delta.tool_calls:
        # Tool call being generated - accumulate
        for tc in delta.tool_calls:
            accumulate_tool_call(tc)

# After stream ends, if there are tool calls:
if pending_tool_calls:
    # Execute tools
    results = await execute_tools(pending_tool_calls)
    # Start a new stream with tool results
    async for chunk in continue_stream(results):
        yield format_sse({"type": "text", "content": chunk})
graph TD
  subgraph flow["Streaming with Tools"]
      T1["Stream text tokens"]
      T2["Model decides to call tool"]
      T3["Pause stream"]
      T4["Execute tool"]
      T5["Resume streaming with tool result in context"]
      T6["Stream remaining response"]
  end

  T1 --> T2
  T2 --> T3
  T3 --> T4
  T4 --> T5
  T5 --> T6

  style T1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style T3 fill:#FAEEDA,stroke:#854F0B,color:#633806
  style T4 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style T6 fill:#E1F5EE,stroke:#0F6E56,color:#085041

Where streaming breaks or gets interesting

Structured output and streaming

If you need valid JSON output, streaming individual tokens means the client receives partial JSON that cannot be parsed until complete. Solutions:

  • Buffer until a complete JSON object is available, then emit
  • Use streaming JSON parsers that handle partial objects
  • Stream the reasoning/explanation but buffer the structured output section

Markdown rendering during stream

Streaming markdown creates rendering issues: a partial **bold without the closing ** renders incorrectly mid-stream. Solutions: buffer until syntax pairs are complete, use incremental markdown parsers, or apply formatting only after sentence boundaries.

Connection drops and retries

SSE connections can drop (mobile networks, load balancer timeouts). Unlike WebSockets, SSE has built-in reconnection. Include id fields in events so clients can resume from the last received event:

id: 42
data: {"content": "Hello"}

id: 43
data: {"content": " world"}

Client sends Last-Event-ID: 43 on reconnection, and the server can resume from that point (if it buffers recent events).

Proxy and infrastructure issues

Many reverse proxies (Nginx, CloudFront) buffer responses by default, defeating streaming. You need:

  • Nginx: proxy_buffering off;
  • CloudFlare: responses stream automatically for SSE content-type
  • AWS ALB: supports streaming natively
  • Vercel/Netlify: use edge functions for streaming support

Rate limiting and backpressure

If the client cannot process tokens as fast as the server sends them, you need backpressure mechanisms. In practice, LLM token generation (50-100 tokens/second) is slow enough that this is rarely an issue for individual streams, but matters for multiplexed connections serving many users.

Real-world streaming implementations

  • ChatGPT - streams responses token-by-token with thinking indicators and tool call pauses
  • Claude - streams with explicit event types: content_block_start, content_block_delta, content_block_stop
  • Vercel AI SDK - provides useChat() and useCompletion() hooks that handle SSE parsing, state management, and UI rendering
  • LangChain - streaming callbacks that emit tokens through the chain pipeline
  • Cursor - streams code edits as diffs applied to the editor in real-time

How to apply in practice

Always stream for user-facing responses. There is no good reason to make users wait for complete responses in interactive applications. The implementation cost is minimal and the UX improvement is dramatic.

Show a thinking indicator before first token. The time between request and first token (TTFT) is still perceived latency. Show a subtle animation or “thinking…” state during this window to signal the system is working.

Handle the “empty stream” case. Sometimes the model produces no output (error, safety filter, empty response). Set a timeout for first token (10-15 seconds) and show an error state if nothing arrives.

Buffer for post-processing. If you need to validate, filter, or transform the response (guardrails, PII redaction), you cannot stream raw model output directly. Stream to a buffer, apply processing, then stream processed output to the client with a small delay.

Log complete responses for monitoring. Streaming makes logging harder - you need to reassemble the full response from chunks for analytics, eval, and debugging. Accumulate chunks server-side alongside streaming to the client.

FAQ

Q: SSE vs WebSockets for LLM streaming - which should I use?

SSE for most LLM applications. SSE is simpler (HTTP-based, automatic reconnection, works through proxies), unidirectional (which is all you need - server sends tokens, client receives), and sufficient for token streaming. WebSockets make sense when you need bidirectional real-time communication (collaborative editing, real-time multiplayer) or when you need to cancel generation mid-stream by sending a message from client to server. For pure LLM streaming, SSE is the right choice.

Q: How do I cancel a streaming response if the user navigates away or clicks “stop”?

Client-side: close the EventSource or abort the fetch request. Server-side: detect the closed connection and cancel the LLM generation (most SDKs support AbortController or stream cancellation). This saves tokens and compute. Implementation:

const controller = new AbortController();
const response = await fetch('/api/chat', { signal: controller.signal });
// User clicks stop:
controller.abort();

Q: Does streaming affect the quality of the response?

No. The model generates the exact same tokens whether streaming is on or off. Streaming is purely a delivery mechanism - it does not change the generation process. The only difference: without streaming, if the connection drops at 90% completion, you get nothing. With streaming, you get 90% of the response. This makes streaming more resilient to network issues.

Interview questions

Q: Design the streaming architecture for a multi-turn AI chatbot that uses RAG and tool calling. Users should see smooth token-by-token streaming even when tools are being called.

Architecture: (1) First phase: retrieve context (non-streaming, fast) → show “searching…” indicator to user. (2) Second phase: stream LLM response. If the model outputs text, stream directly to client. If the model outputs a tool call, pause the visible stream, show “looking up information…” indicator, execute the tool, then resume streaming with the tool result injected. (3) Handle gracefully: if tool execution takes >3 seconds, show progress. If it fails, stream an explanation. (4) Client implementation: use ReadableStream with state machine (streaming_text → tool_call_pending → streaming_text). (5) Edge cases: model calls multiple tools in sequence - show appropriate status for each. Model interleaves text and tool calls - buffer small text segments before tool calls to avoid jarring UX of one word → pause → one word.

Q: Your streaming endpoint works locally but responses arrive in large chunks (not token-by-token) in production. What is wrong?

Almost certainly response buffering by an intermediary. Diagnose by layer: (1) Check Nginx/reverse proxy config - needs proxy_buffering off; and X-Accel-Buffering: no header. (2) Check CDN/edge (CloudFront, CloudFlare) - may need specific streaming configuration. (3) Check application framework - some frameworks buffer response bodies. Ensure you are using streaming response types (StreamingResponse in FastAPI, not regular Response). (4) Check compression - gzip compression can buffer until a complete compression block is ready. Either disable compression for SSE endpoints or use chunked transfer encoding with flush after each event. (5) Docker/container networking - some container runtimes buffer stdio. Fix: test each layer independently (curl directly to app server, then through each proxy layer) to isolate where buffering occurs.