Next.js 15 Streaming AI Chat: 6 Production Patterns from SSE to React Server Actions
Why AI Chat Always Feels Laggy
You open an AI chat page, type a question, and stare at an empty response box for 8 seconds — the LLM finally dumps a wall of text, but the user already assumed the page crashed. Under the traditional request-response model, perceived latency = network latency + LLM time-to-first-token + full generation time, which feels like "frozen." Streaming breaks the wait into individual tokens pushed incrementally — users see the first character within 0.5 seconds, reducing perceived latency by 80%.
Next.js 15 provides a complete toolchain for streaming AI chat, from low-level SSE to high-level React Server Actions. This article walks you through SSE streaming response → React Server Actions + AI → Vercel AI SDK integration → multi-model routing with fallback → conversation state and context management → production deployment and performance optimization — 6 production patterns from protocol to launch.
Core Takeaways
- SSE is the optimal transport protocol for LLM streaming output; Next.js Route Handlers support it natively
- React Server Actions eliminate the need for hand-written API endpoints — type-safe, zero boilerplate
- Vercel AI SDK unifies streaming interfaces across OpenAI, Anthropic, Google, and more
- Multi-model routing enables cost optimization and disaster fallback — auto-switch to Claude when GPT-4o is down
- Conversation state management must distinguish short-term context from long-term memory to avoid token explosion
- Production deployment must handle concurrent connections, timeouts, backpressure, and error recovery
Table of Contents
- Streaming AI Chat Architecture Overview
- Pattern 1: SSE Streaming Response Implementation
- Pattern 2: React Server Actions + AI
- Pattern 3: Vercel AI SDK Integration
- Pattern 4: Multi-Model Routing and Fallback
- Pattern 5: Conversation State and Context Management
- Pattern 6: Production Deployment and Performance Optimization
- 5 Common Pitfalls and Solutions
- 10 Common Error Troubleshooting
- Advanced Optimization Techniques
- Comparison: SSE vs WebSocket vs Long Polling
- Recommended Online Tools
- Summary
Streaming AI Chat Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ Next.js 15 AI Chat Architecture │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ SSE/Stream ┌──────────────────────┐ │
│ │ Client │ ◄────────────── │ Route Handler / │ │
│ │ Chat UI │ │ Server Action │ │
│ │ │ ──────────────► │ │ │
│ └──────────┘ POST request └──────────┬───────────┘ │
│ │ │
│ ┌─────────────┼──────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────┐ │
│ │ OpenAI │ │ Anthropic│ │ Local │ │
│ │ GPT-4o │ │ Claude │ │ Ollama│ │
│ └──────────┘ └──────────┘ └──────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Conversation State Layer │ │
│ │ ┌─────────┐ ┌──────────┐ ┌──────────────────┐ │ │
│ │ │ Context │ │ History │ │ Long-term Memory │ │ │
│ │ │ Window │ │ Store │ │ (Vector DB) │ │ │
│ │ └─────────┘ └──────────┘ └──────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Technology Selection Decision Tree
Need streaming AI chat?
├── Quick prototype → Vercel AI SDK (Pattern 3)
├── Full control → SSE + Route Handler (Pattern 1)
├── Type safety first → React Server Actions (Pattern 2)
├── Multi-model needs → Multi-model routing (Pattern 4)
└── Enterprise production → All combined + state management (Pattern 5+6)
Pattern 1: SSE Streaming Response Implementation
SSE (Server-Sent Events) is the standard transport protocol for LLM streaming output. Next.js 15 Route Handlers natively support ReadableStream, enabling direct SSE-formatted streaming responses.
Basic SSE Route Handler
// app/api/chat/sse/route.ts
import { NextRequest } from 'next/server';
export const runtime = 'edge';
export const maxDuration = 60;
interface ChatMessage {
role: 'system' | 'user' | 'assistant';
content: string;
}
export async function POST(request: NextRequest) {
const { messages }: { messages: ChatMessage[] } = await request.json();
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
try {
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
},
body: JSON.stringify({
model: 'gpt-4o',
messages,
stream: true,
}),
});
if (!response.ok) {
const errorData = await response.text();
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ error: errorData })}\n\n`)
);
controller.close();
return;
}
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
const lines = chunk.split('\n').filter((line) => line.startsWith('data: '));
for (const line of lines) {
const data = line.slice(6);
if (data === '[DONE]') {
controller.enqueue(encoder.encode('data: [DONE]\n\n'));
continue;
}
try {
const parsed = JSON.parse(data);
const content = parsed.choices?.[0]?.delta?.content;
if (content) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ content })}\n\n`)
);
}
} catch {
// skip malformed chunks
}
}
}
controller.close();
} catch (error) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ error: String(error) })}\n\n`)
);
controller.close();
}
},
});
return new Response(stream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache',
Connection: 'keep-alive',
},
});
}
Client-Side SSE Consumer
// components/ChatSSE.tsx
'use client';
import { useState, useRef, useCallback } from 'react';
interface Message {
id: string;
role: 'user' | 'assistant';
content: string;
}
export default function ChatSSE() {
const [messages, setMessages] = useState<Message[]>([]);
const [input, setInput] = useState('');
const [isStreaming, setIsStreaming] = useState(false);
const abortControllerRef = useRef<AbortController | null>(null);
const sendMessage = useCallback(async () => {
if (!input.trim() || isStreaming) return;
const userMessage: Message = {
id: crypto.randomUUID(),
role: 'user',
content: input.trim(),
};
const assistantMessage: Message = {
id: crypto.randomUUID(),
role: 'assistant',
content: '',
};
setMessages((prev) => [...prev, userMessage, assistantMessage]);
setInput('');
setIsStreaming(true);
const abortController = new AbortController();
abortControllerRef.current = abortController;
try {
const response = await fetch('/api/chat/sse', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
messages: [...messages, userMessage].map((m) => ({
role: m.role,
content: m.content,
})),
}),
signal: abortController.signal,
});
if (!response.ok) throw new Error(`HTTP ${response.status}`);
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
const lines = chunk.split('\n').filter((line) => line.startsWith('data: '));
for (const line of lines) {
const data = line.slice(6);
if (data === '[DONE]') continue;
try {
const parsed = JSON.parse(data);
if (parsed.error) {
console.error('SSE error:', parsed.error);
continue;
}
if (parsed.content) {
setMessages((prev) =>
prev.map((m) =>
m.id === assistantMessage.id
? { ...m, content: m.content + parsed.content }
: m
)
);
}
} catch {
// skip malformed data
}
}
}
} catch (error) {
if ((error as Error).name !== 'AbortError') {
console.error('Stream error:', error);
}
} finally {
setIsStreaming(false);
abortControllerRef.current = null;
}
}, [input, messages, isStreaming]);
const stopStreaming = useCallback(() => {
abortControllerRef.current?.abort();
}, []);
return (
<div className="flex flex-col h-screen max-w-3xl mx-auto p-4">
<div className="flex-1 overflow-y-auto space-y-4 mb-4">
{messages.map((msg) => (
<div
key={msg.id}
className={`p-3 rounded-lg ${
msg.role === 'user'
? 'bg-blue-600 text-white ml-auto max-w-[80%]'
: 'bg-gray-100 text-gray-900 mr-auto max-w-[80%]'
}`}
>
<pre className="whitespace-pre-wrap font-sans text-sm">{msg.content}</pre>
</div>
))}
</div>
<div className="flex gap-2">
<input
type="text"
value={input}
onChange={(e) => setInput(e.target.value)}
onKeyDown={(e) => e.key === 'Enter' && !e.shiftKey && sendMessage()}
placeholder="Type a message..."
className="flex-1 px-4 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-500"
disabled={isStreaming}
/>
<button
onClick={isStreaming ? stopStreaming : sendMessage}
className={`px-4 py-2 rounded-lg font-medium ${
isStreaming
? 'bg-red-500 hover:bg-red-600 text-white'
: 'bg-blue-600 hover:bg-blue-700 text-white'
}`}
>
{isStreaming ? 'Stop' : 'Send'}
</button>
</div>
</div>
);
}
Pattern 2: React Server Actions + AI
React Server Actions eliminate the need for hand-written API endpoints. Define async functions directly in Server Components, invoke them from the client via useActionState — type-safe, zero boilerplate.
Server Action Definition
// app/actions/chat-action.ts
'use server';
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
interface ChatState {
messages: Array<{ role: 'user' | 'assistant'; content: string }>;
error?: string;
}
export async function chatAction(
prevState: ChatState,
formData: FormData
): Promise<ChatState> {
const userInput = formData.get('message') as string;
if (!userInput?.trim()) {
return { ...prevState, error: 'Message cannot be empty' };
}
const newMessages = [
...prevState.messages,
{ role: 'user' as const, content: userInput.trim() },
];
try {
const result = await streamText({
model: openai('gpt-4o'),
messages: newMessages,
});
const text = await result.text;
return {
messages: [
...newMessages,
{ role: 'assistant' as const, content: text },
],
};
} catch (error) {
return {
...prevState,
messages: newMessages,
error: `AI call failed: ${String(error)}`,
};
}
}
Streaming Server Action
// app/actions/streaming-chat-action.ts
'use server';
import { createStreamableValue } from 'ai/rsc';
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
export async function streamingChatAction(messages: Array<{ role: 'user' | 'assistant'; content: string }>) {
const streamableValue = createStreamableValue('');
(async () => {
try {
const result = await streamText({
model: openai('gpt-4o'),
messages,
});
for await (const chunk of result.textStream) {
streamableValue.update(chunk);
}
streamableValue.done();
} catch (error) {
streamableValue.error(String(error));
}
})();
return streamableValue.value;
}
Client-Side Streaming Server Action Consumer
// components/ChatServerAction.tsx
'use client';
import { readStreamableValue } from 'ai/rsc';
import { streamingChatAction } from '@/app/actions/streaming-chat-action';
import { useState, useCallback } from 'react';
interface Message {
id: string;
role: 'user' | 'assistant';
content: string;
}
export default function ChatServerAction() {
const [messages, setMessages] = useState<Message[]>([]);
const [input, setInput] = useState('');
const [isStreaming, setIsStreaming] = useState(false);
const handleSubmit = useCallback(async () => {
if (!input.trim() || isStreaming) return;
const userMessage: Message = {
id: crypto.randomUUID(),
role: 'user',
content: input.trim(),
};
const assistantMessage: Message = {
id: crypto.randomUUID(),
role: 'assistant',
content: '',
};
const updatedMessages = [...messages, userMessage];
setMessages([...updatedMessages, assistantMessage]);
setInput('');
setIsStreaming(true);
try {
const streamValue = await streamingChatAction(
updatedMessages.map((m) => ({ role: m.role, content: m.content }))
);
for await (const chunk of readStreamableValue(streamValue)) {
if (chunk) {
setMessages((prev) =>
prev.map((m) =>
m.id === assistantMessage.id
? { ...m, content: m.content + chunk }
: m
)
);
}
}
} catch (error) {
console.error('Server Action error:', error);
} finally {
setIsStreaming(false);
}
}, [input, messages, isStreaming]);
return (
<div className="flex flex-col h-screen max-w-3xl mx-auto p-4">
<div className="flex-1 overflow-y-auto space-y-4 mb-4">
{messages.map((msg) => (
<div
key={msg.id}
className={`p-3 rounded-lg ${
msg.role === 'user'
? 'bg-blue-600 text-white ml-auto max-w-[80%]'
: 'bg-gray-100 text-gray-900 mr-auto max-w-[80%]'
}`}
>
<pre className="whitespace-pre-wrap font-sans text-sm">{msg.content}</pre>
</div>
))}
</div>
<div className="flex gap-2">
<input
type="text"
value={input}
onChange={(e) => setInput(e.target.value)}
onKeyDown={(e) => e.key === 'Enter' && !e.shiftKey && handleSubmit()}
placeholder="Type a message..."
className="flex-1 px-4 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-500"
disabled={isStreaming}
/>
<button
onClick={handleSubmit}
disabled={isStreaming}
className="px-4 py-2 bg-blue-600 hover:bg-blue-700 text-white rounded-lg font-medium disabled:opacity-50"
>
{isStreaming ? 'Generating...' : 'Send'}
</button>
</div>
</div>
);
}
Pattern 3: Vercel AI SDK Integration
The Vercel AI SDK (ai package) is the officially recommended approach for Next.js streaming AI chat. It unifies streaming interfaces across multiple LLM providers and provides ready-to-use hooks like useChat.
Installation and Configuration
npm install ai @ai-sdk/openai @ai-sdk/anthropic @ai-sdk/google
Route Handler (AI SDK Version)
// app/api/chat/route.ts
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';
export const runtime = 'edge';
export const maxDuration = 60;
export async function POST(request: Request) {
const { messages } = await request.json();
const result = streamText({
model: openai('gpt-4o'),
system: 'You are a helpful AI assistant. Answer questions concisely and accurately.',
messages,
maxTokens: 4096,
temperature: 0.7,
});
return result.toDataStreamResponse();
}
useChat Hook (Minimal Implementation)
// components/ChatAISDK.tsx
'use client';
import { useChat } from '@ai-sdk/react';
export default function ChatAISDK() {
const { messages, input, handleInputChange, handleSubmit, isLoading, stop } =
useChat({
api: '/api/chat',
onError: (error) => {
console.error('Chat error:', error);
},
onFinish: (message) => {
console.log('Finished:', message.content.length, 'chars');
},
});
return (
<div className="flex flex-col h-screen max-w-3xl mx-auto p-4">
<div className="flex-1 overflow-y-auto space-y-4 mb-4">
{messages.map((msg) => (
<div
key={msg.id}
className={`p-3 rounded-lg ${
msg.role === 'user'
? 'bg-blue-600 text-white ml-auto max-w-[80%]'
: 'bg-gray-100 text-gray-900 mr-auto max-w-[80%]'
}`}
>
<div className="whitespace-pre-wrap text-sm">{msg.content}</div>
</div>
))}
{isLoading && messages[messages.length - 1]?.role === 'user' && (
<div className="bg-gray-100 text-gray-900 mr-auto max-w-[80%] p-3 rounded-lg">
<div className="flex items-center gap-2 text-sm text-gray-500">
<div className="animate-pulse">●</div>
<div className="animate-pulse delay-75">●</div>
<div className="animate-pulse delay-150">●</div>
</div>
</div>
)}
</div>
<form onSubmit={handleSubmit} className="flex gap-2">
<input
value={input}
onChange={handleInputChange}
placeholder="Type a message..."
className="flex-1 px-4 py-2 border border-gray-300 rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-500"
/>
<button
type={isLoading ? 'button' : 'submit'}
onClick={isLoading ? stop : undefined}
className={`px-4 py-2 rounded-lg font-medium ${
isLoading
? 'bg-red-500 hover:bg-red-600 text-white'
: 'bg-blue-600 hover:bg-blue-700 text-white'
}`}
>
{isLoading ? 'Stop' : 'Send'}
</button>
</form>
</div>
);
}
AI SDK Advanced Configuration
// app/api/chat/advanced/route.ts
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';
import { z } from 'zod';
export async function POST(request: Request) {
const { messages, conversationId } = await request.json();
const result = streamText({
model: openai('gpt-4o'),
system: 'You are a helpful AI assistant.',
messages,
maxTokens: 4096,
temperature: 0.7,
tools: {
searchWeb: {
description: 'Search the web for up-to-date information',
parameters: z.object({
query: z.string().describe('Search query'),
}),
execute: async ({ query }) => {
const res = await fetch(
`https://api.search.example.com/search?q=${encodeURIComponent(query)}`
);
return res.json();
},
},
},
onChunk: async ({ chunk }) => {
if (chunk.type === 'tool-call') {
console.log(`[Chat ${conversationId}] Tool call:`, chunk.toolName);
}
},
onFinish: async ({ response, usage }) => {
console.log(`[Chat ${conversationId}] Tokens:`, usage);
},
});
return result.toDataStreamResponse({
headers: {
'X-Conversation-Id': conversationId,
},
});
}
Pattern 4: Multi-Model Routing and Fallback
In production, you can't rely on a single LLM provider. Multi-model routing enables cost optimization and disaster fallback — auto-switch to Claude when GPT-4o is down, use GPT-4o-mini for simple questions to save costs.
Model Router
// lib/ai/model-router.ts
import { LanguageModelV1 } from 'ai';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
import { google } from '@ai-sdk/google';
type ModelTier = 'fast' | 'standard' | 'premium';
interface ModelConfig {
model: LanguageModelV1;
name: string;
tier: ModelTier;
maxRetries: number;
timeoutMs: number;
costPer1kTokens: number;
}
const MODEL_REGISTRY: Record<ModelTier, ModelConfig[]> = {
fast: [
{
model: openai('gpt-4o-mini'),
name: 'gpt-4o-mini',
tier: 'fast',
maxRetries: 2,
timeoutMs: 15000,
costPer1kTokens: 0.00015,
},
{
model: anthropic('claude-3-5-haiku-2024-10-22'),
name: 'claude-3.5-haiku',
tier: 'fast',
maxRetries: 2,
timeoutMs: 15000,
costPer1kTokens: 0.00025,
},
],
standard: [
{
model: openai('gpt-4o'),
name: 'gpt-4o',
tier: 'standard',
maxRetries: 2,
timeoutMs: 30000,
costPer1kTokens: 0.005,
},
{
model: anthropic('claude-sonnet-4-20250514'),
name: 'claude-sonnet-4',
tier: 'standard',
maxRetries: 2,
timeoutMs: 30000,
costPer1kTokens: 0.003,
},
],
premium: [
{
model: openai('o3'),
name: 'o3',
tier: 'premium',
maxRetries: 1,
timeoutMs: 60000,
costPer1kTokens: 0.03,
},
{
model: anthropic('claude-opus-4-20250514'),
name: 'claude-opus-4',
tier: 'premium',
maxRetries: 1,
timeoutMs: 60000,
costPer1kTokens: 0.015,
},
],
};
interface RouteDecision {
model: LanguageModelV1;
modelName: string;
tier: ModelTier;
fallbackChain: string[];
}
export function routeModel(
complexity: 'simple' | 'medium' | 'complex',
preferredProvider?: 'openai' | 'anthropic' | 'google'
): RouteDecision {
const tierMap: Record<string, ModelTier> = {
simple: 'fast',
medium: 'standard',
complex: 'premium',
};
const tier = tierMap[complexity];
const models = MODEL_REGISTRY[tier];
const preferred = preferredProvider
? models.find((m) => m.name.startsWith(preferredProvider))
: models[0];
const selected = preferred || models[0];
const fallbackChain = models
.filter((m) => m.name !== selected.name)
.map((m) => m.name);
return {
model: selected.model,
modelName: selected.name,
tier,
fallbackChain,
};
}
export { MODEL_REGISTRY };
export type { ModelConfig, ModelTier };
Streaming Route Handler with Fallback
// app/api/chat/routed/route.ts
import { streamText } from 'ai';
import { routeModel, MODEL_REGISTRY } from '@/lib/ai/model-router';
import { NextRequest } from 'next/server';
export const runtime = 'edge';
export const maxDuration = 60;
interface ChatRequest {
messages: Array<{ role: string; content: string }>;
complexity?: 'simple' | 'medium' | 'complex';
provider?: 'openai' | 'anthropic';
}
export async function POST(request: NextRequest) {
const body: ChatRequest = await request.json();
const { messages, complexity = 'medium', provider } = body;
const route = routeModel(complexity, provider);
try {
const result = streamText({
model: route.model,
system: 'You are a helpful AI assistant.',
messages,
maxTokens: 4096,
abortSignal: request.signal,
});
return result.toDataStreamResponse({
headers: {
'X-Model-Name': route.modelName,
'X-Model-Tier': route.tier,
'X-Fallback-Chain': route.fallbackChain.join(','),
},
});
} catch (error) {
const fallbackModelName = route.fallbackChain[0];
const allModels = Object.values(MODEL_REGISTRY).flat();
const fallback = allModels.find((m) => m.name === fallbackModelName);
if (!fallback) {
return new Response(
JSON.stringify({ error: 'All models unavailable' }),
{ status: 503, headers: { 'Content-Type': 'application/json' } }
);
}
const result = streamText({
model: fallback.model,
system: 'You are a helpful AI assistant.',
messages,
maxTokens: 4096,
});
return result.toDataStreamResponse({
headers: {
'X-Model-Name': fallback.name,
'X-Model-Tier': fallback.tier,
'X-Fallback-Used': 'true',
},
});
}
}
Pattern 5: Conversation State and Context Management
One of the core challenges of AI chat is context management — longer conversation history means exponentially growing token consumption and cost. You must distinguish short-term context windows from long-term memory.
Conversation State Management
// lib/ai/conversation-state.ts
import { Redis } from '@upstash/redis';
const redis = new Redis({
url: process.env.REDIS_URL!,
token: process.env.REDIS_TOKEN!,
});
interface ConversationMessage {
id: string;
role: 'system' | 'user' | 'assistant';
content: string;
timestamp: number;
tokenCount: number;
}
interface ConversationState {
id: string;
userId: string;
title: string;
messages: ConversationMessage[];
summary?: string;
totalTokens: number;
createdAt: number;
updatedAt: number;
}
const MAX_CONTEXT_TOKENS = 8000;
const SUMMARY_THRESHOLD = 6000;
function estimateTokenCount(text: string): number {
return Math.ceil(text.length / 3.5);
}
export async function getConversation(conversationId: string): Promise<ConversationState | null> {
const data = await redis.get<ConversationState>(
`conversation:${conversationId}`
);
return data;
}
export async function saveConversation(state: ConversationState): Promise<void> {
state.updatedAt = Date.now();
await redis.set(`conversation:${state.id}`, JSON.stringify(state), {
ex: 86400 * 30,
});
}
export async function createContextWindow(
conversationId: string
): Promise<ConversationMessage[]> {
const conversation = await getConversation(conversationId);
if (!conversation) return [];
const messages = conversation.messages;
const totalTokens = messages.reduce((sum, m) => sum + m.tokenCount, 0);
if (totalTokens <= MAX_CONTEXT_TOKENS) {
return messages;
}
if (conversation.summary) {
const summaryMessage: ConversationMessage = {
id: 'summary',
role: 'system',
content: `Summary of previous conversation:\n${conversation.summary}`,
timestamp: Date.now(),
tokenCount: estimateTokenCount(conversation.summary),
};
const recentMessages: ConversationMessage[] = [];
let currentTokens = summaryMessage.tokenCount;
for (let i = messages.length - 1; i >= 0; i--) {
if (currentTokens + messages[i].tokenCount > MAX_CONTEXT_TOKENS) break;
recentMessages.unshift(messages[i]);
currentTokens += messages[i].tokenCount;
}
return [summaryMessage, ...recentMessages];
}
const result: ConversationMessage[] = [];
let currentTokens = 0;
for (let i = messages.length - 1; i >= 0; i--) {
if (currentTokens + messages[i].tokenCount > MAX_CONTEXT_TOKENS) break;
result.unshift(messages[i]);
currentTokens += messages[i].tokenCount;
}
return result;
}
export async function generateSummary(
conversationId: string,
messages: ConversationMessage[]
): Promise<string> {
const conversationText = messages
.map((m) => `${m.role}: ${m.content}`)
.join('\n');
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
},
body: JSON.stringify({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: 'Summarize the key points and conclusions of this conversation in 2-3 sentences.',
},
{ role: 'user', content: conversationText },
],
max_tokens: 200,
}),
});
const data = await response.json();
const summary = data.choices?.[0]?.message?.content || '';
const conversation = await getConversation(conversationId);
if (conversation) {
conversation.summary = summary;
await saveConversation(conversation);
}
return summary;
}
export { estimateTokenCount, MAX_CONTEXT_TOKENS, SUMMARY_THRESHOLD };
export type { ConversationMessage, ConversationState };
Conversation Management Route Handler
// app/api/chat/managed/route.ts
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { NextRequest } from 'next/server';
import {
getConversation,
saveConversation,
createContextWindow,
generateSummary,
estimateTokenCount,
SUMMARY_THRESHOLD,
} from '@/lib/ai/conversation-state';
import type { ConversationMessage, ConversationState } from '@/lib/ai/conversation-state';
export const runtime = 'edge';
export const maxDuration = 60;
export async function POST(request: NextRequest) {
const { message, conversationId, userId } = await request.json();
let conversation = await getConversation(conversationId);
if (!conversation) {
conversation = {
id: conversationId,
userId,
title: message.slice(0, 30),
messages: [],
totalTokens: 0,
createdAt: Date.now(),
updatedAt: Date.now(),
};
}
const userMsg: ConversationMessage = {
id: crypto.randomUUID(),
role: 'user',
content: message,
timestamp: Date.now(),
tokenCount: estimateTokenCount(message),
};
conversation.messages.push(userMsg);
conversation.totalTokens += userMsg.tokenCount;
if (conversation.totalTokens > SUMMARY_THRESHOLD && !conversation.summary) {
generateSummary(conversationId, conversation.messages.slice(0, -3)).catch(
console.error
);
}
const contextMessages = await createContextWindow(conversationId);
const result = streamText({
model: openai('gpt-4o'),
system: 'You are a helpful AI assistant. Answer questions concisely and accurately.',
messages: contextMessages.map((m) => ({
role: m.role as 'system' | 'user' | 'assistant',
content: m.content,
})),
maxTokens: 4096,
onFinish: async ({ text, usage }) => {
const assistantMsg: ConversationMessage = {
id: crypto.randomUUID(),
role: 'assistant',
content: text,
timestamp: Date.now(),
tokenCount: usage?.totalTokens || estimateTokenCount(text),
};
conversation!.messages.push(assistantMsg);
conversation!.totalTokens += assistantMsg.tokenCount;
await saveConversation(conversation!);
},
});
return result.toDataStreamResponse();
}
Pattern 6: Production Deployment and Performance Optimization
Concurrent Connection Management
// lib/ai/connection-pool.ts
import { Redis } from '@upstash/redis';
const redis = new Redis({
url: process.env.REDIS_URL!,
token: process.env.REDIS_TOKEN!,
});
const MAX_CONCURRENT_CONNECTIONS = 100;
const CONNECTION_TTL_SECONDS = 120;
export async function acquireConnection(userId: string): Promise<boolean> {
const key = `conn:${userId}`;
const current = await redis.incr(key);
if (current === 1) {
await redis.expire(key, CONNECTION_TTL_SECONDS);
}
if (current > MAX_CONCURRENT_CONNECTIONS) {
await redis.decr(key);
return false;
}
return true;
}
export async function releaseConnection(userId: string): Promise<void> {
const key = `conn:${userId}`;
const current = await redis.decr(key);
if (current <= 0) {
await redis.del(key);
}
}
export async function getConnectionCount(userId: string): Promise<number> {
const count = await redis.get<number>(`conn:${userId}`);
return count || 0;
}
Rate Limiting Middleware
// lib/ai/rate-limiter.ts
import { Redis } from '@upstash/redis';
import { NextRequest, NextResponse } from 'next/server';
const redis = new Redis({
url: process.env.REDIS_URL!,
token: process.env.REDIS_TOKEN!,
});
interface RateLimitConfig {
windowMs: number;
maxRequests: number;
}
const RATE_LIMITS: Record<string, RateLimitConfig> = {
free: { windowMs: 60000, maxRequests: 10 },
pro: { windowMs: 60000, maxRequests: 60 },
enterprise: { windowMs: 60000, maxRequests: 300 },
};
export async function rateLimitMiddleware(
request: NextRequest,
userId: string,
tier: string = 'free'
): Promise<NextResponse | null> {
const config = RATE_LIMITS[tier] || RATE_LIMITS.free;
const key = `rate:${userId}:${Math.floor(Date.now() / config.windowMs)}`;
const current = await redis.incr(key);
if (current === 1) {
await redis.expire(key, Math.ceil(config.windowMs / 1000));
}
if (current > config.maxRequests) {
return NextResponse.json(
{
error: 'Too many requests. Please try again later.',
retryAfter: config.windowMs / 1000,
},
{
status: 429,
headers: {
'Retry-After': String(Math.ceil(config.windowMs / 1000)),
'X-RateLimit-Limit': String(config.maxRequests),
'X-RateLimit-Remaining': '0',
},
}
);
}
return null;
}
Production Chat API with Full Middleware Stack
// app/api/chat/production/route.ts
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { NextRequest, NextResponse } from 'next/server';
import { acquireConnection, releaseConnection } from '@/lib/ai/connection-pool';
import { rateLimitMiddleware } from '@/lib/ai/rate-limiter';
import { routeModel } from '@/lib/ai/model-router';
export const runtime = 'edge';
export const maxDuration = 60;
export async function POST(request: NextRequest) {
const body = await request.json();
const { messages, userId = 'anonymous', complexity = 'medium' } = body;
// 1. Rate limit check
const rateLimitResponse = await rateLimitMiddleware(request, userId);
if (rateLimitResponse) return rateLimitResponse;
// 2. Connection limit check
const acquired = await acquireConnection(userId);
if (!acquired) {
return NextResponse.json(
{ error: 'Too many concurrent connections. Please try again later.' },
{ status: 429 }
);
}
try {
// 3. Model routing
const route = routeModel(complexity);
// 4. Streaming generation
const result = streamText({
model: route.model,
system: 'You are a helpful AI assistant.',
messages,
maxTokens: 4096,
abortSignal: request.signal,
});
// 5. Wrap response to ensure connection cleanup
const response = result.toDataStreamResponse({
headers: {
'X-Model-Name': route.modelName,
'X-Request-Id': crypto.randomUUID(),
},
});
const [body1, body2] = [response.body!, response.body!];
const finalResponse = new Response(body1, {
status: response.status,
headers: response.headers,
});
const reader = body2.getReader();
(async () => {
try {
while (true) {
const { done } = await reader.read();
if (done) break;
}
} finally {
await releaseConnection(userId);
}
})();
return finalResponse;
} catch (error) {
await releaseConnection(userId);
return NextResponse.json(
{ error: `AI service error: ${String(error)}` },
{ status: 500 }
);
}
}
Health Check and Monitoring
// app/api/health/ai/route.ts
import { NextResponse } from 'next/server';
interface HealthCheck {
service: string;
status: 'healthy' | 'degraded' | 'down';
latencyMs: number;
error?: string;
}
async function checkOpenAI(): Promise<HealthCheck> {
const start = Date.now();
try {
const res = await fetch('https://api.openai.com/v1/models', {
headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` },
signal: AbortSignal.timeout(5000),
});
return {
service: 'openai',
status: res.ok ? 'healthy' : 'degraded',
latencyMs: Date.now() - start,
};
} catch (error) {
return {
service: 'openai',
status: 'down',
latencyMs: Date.now() - start,
error: String(error),
};
}
}
async function checkAnthropic(): Promise<HealthCheck> {
const start = Date.now();
try {
const res = await fetch('https://api.anthropic.com/v1/models', {
headers: {
'x-api-key': process.env.ANTHROPIC_API_KEY!,
'anthropic-version': '2023-06-01',
},
signal: AbortSignal.timeout(5000),
});
return {
service: 'anthropic',
status: res.ok ? 'healthy' : 'degraded',
latencyMs: Date.now() - start,
};
} catch (error) {
return {
service: 'anthropic',
status: 'down',
latencyMs: Date.now() - start,
error: String(error),
};
}
}
export async function GET() {
const checks = await Promise.all([checkOpenAI(), checkAnthropic()]);
const overallStatus = checks.every((c) => c.status === 'healthy')
? 'healthy'
: checks.some((c) => c.status === 'healthy')
? 'degraded'
: 'down';
return NextResponse.json({
status: overallStatus,
timestamp: new Date().toISOString(),
checks,
});
}
5 Common Pitfalls and Solutions
Pitfall 1: SSE Connections Buffered by Nginx/CDN
Symptom: Streaming responses arrive all at once instead of incrementally.
Cause: Nginx enables proxy_buffering by default; CDNs also buffer SSE responses.
Solution:
# nginx.conf
location /api/chat {
proxy_pass http://nextjs_backend;
proxy_buffering off;
proxy_cache off;
proxy_set_header Connection '';
proxy_http_version 1.1;
chunked_transfer_encoding on;
proxy_read_timeout 300s;
}
// Add response headers in Route Handler
return new Response(stream, {
headers: {
'Content-Type': 'text/event-stream',
'Cache-Control': 'no-cache, no-transform',
'X-Accel-Buffering': 'no',
Connection: 'keep-alive',
},
});
Pitfall 2: Node.js Native Modules Unavailable in Edge Runtime
Symptom: Deploying to Vercel Edge Functions throws Module not found errors.
Cause: Edge Runtime doesn't support Node.js modules like net, fs, child_process.
Solution: Use Edge-compatible alternatives — @upstash/redis instead of ioredis, fetch instead of http.
Pitfall 3: useChat Messages State Out of Sync with External State
Symptom: You maintain a separate messages copy outside useChat, but they diverge.
Cause: useChat manages its own internal messages state; external modifications don't reflect internally.
Solution: Use the onFinish callback to sync state, or the setMessages method.
const { messages, setMessages, ...rest } = useChat({
onFinish: (message) => {
externalStore.update(message);
},
});
useEffect(() => {
setMessages(loadedMessages);
}, [loadedMessages]);
Pitfall 4: Stream Interruption Without Recovery
Symptom: Network jitter drops the SSE connection, and received content is lost.
Cause: SSE has no built-in resume mechanism (unlike gRPC).
Solution: Cache received content client-side; on reconnect, include existing content as context.
async function recoverStream(
conversationId: string,
lastReceivedContent: string
) {
const response = await fetch('/api/chat/recover', {
method: 'POST',
body: JSON.stringify({
conversationId,
lastContent: lastReceivedContent,
}),
});
return response.body;
}
Pitfall 5: Memory Leaks from Unclosed SSE Connections
Symptom: Server memory grows continuously until OOM.
Cause: ReadableStream and AbortController not properly closed.
Solution: Ensure all streams have timeouts and cleanup mechanisms.
const stream = new ReadableStream({
start(controller) {
const timeout = setTimeout(() => {
controller.close();
}, 60000);
// ... streaming logic
return () => clearTimeout(timeout);
},
});
10 Common Error Troubleshooting
| # | Error Message | Cause | Solution |
|---|---|---|---|
| 1 | TypeError: response.body is null |
Route Handler not returning a streaming response | Confirm returning ReadableStream or using toDataStreamResponse() |
| 2 | AI_APICallError: 429 Too Many Requests |
LLM API rate limiting | Implement exponential backoff or switch to fallback model |
| 3 | AI_APICallError: context_length_exceeded |
Conversation history exceeds model context window | Implement context window trimming or summary compression |
| 4 | Error: Invalid SSE data |
Incorrect SSE data format | Check data: prefix and \n\n delimiters |
| 5 | AbortError: The operation was aborted |
User cancelled or request timed out | Properly handle AbortSignal, clean up resources |
| 6 | Error: Cannot read properties of undefined (reading 'delta') |
LLM returned non-standard chunk format | Add defensive parsing, skip malformed chunks |
| 7 | RuntimeError: Edge Runtime does not support Node.js API |
Using Node.js API in Edge Runtime | Switch to Node.js Runtime or use Edge-compatible libraries |
| 8 | Error: Maximum call stack size exceeded |
Recursive stream processing causing stack overflow | Use iteration instead of recursion for chunk processing |
| 9 | TypeError: Failed to execute 'fetch' on 'Window' |
Browser CORS restrictions | Ensure API route and frontend share the same origin, or configure CORS headers |
| 10 | Error: Stream ended unexpectedly |
Server-side stream closed prematurely | Add heartbeat mechanism for connection health, implement auto-reconnect |
Advanced Optimization Techniques
1. Streaming Markdown Rendering
// components/StreamMarkdown.tsx
'use client';
import { memo, useMemo } from 'react';
interface StreamMarkdownProps {
content: string;
isStreaming: boolean;
}
const StreamMarkdown = memo(function StreamMarkdown({
content,
isStreaming,
}: StreamMarkdownProps) {
const html = useMemo(() => {
return content
.replace(/```(\w*)\n([\s\S]*?)```/g, '<pre><code class="language-$1">$2</code></pre>')
.replace(/`([^`]+)`/g, '<code>$1</code>')
.replace(/\*\*([^*]+)\*\*/g, '<strong>$1</strong>')
.replace(/\*([^*]+)\*/g, '<em>$1</em>')
.replace(/^### (.+)$/gm, '<h3>$1</h3>')
.replace(/^## (.+)$/gm, '<h2>$1</h2>')
.replace(/^# (.+)$/gm, '<h1>$1</h1>')
.replace(/\n/g, '<br/>');
}, [content]);
return (
<div className="prose prose-sm max-w-none">
<div dangerouslySetInnerHTML={{ __html: html }} />
{isStreaming && (
<span className="inline-block w-2 h-4 bg-blue-600 animate-pulse ml-0.5" />
)}
</div>
);
});
export default StreamMarkdown;
2. Predictive Prefetching
// lib/ai/prefetch.ts
export function predictNextQuery(messages: Array<{ role: string; content: string }>): string | null {
const lastMessage = messages[messages.length - 1];
if (!lastMessage || lastMessage.role !== 'assistant') return null;
if (lastMessage.content.includes('```')) {
return 'Please run this code for me';
}
if (lastMessage.content.includes('1.') || lastMessage.content.includes('- ')) {
return 'Please explain the first point in detail';
}
return null;
}
3. Streaming Response Caching
// lib/ai/stream-cache.ts
import { Redis } from '@upstash/redis';
const redis = new Redis({
url: process.env.REDIS_URL!,
token: process.env.REDIS_TOKEN!,
});
export async function getCachedStream(queryHash: string): Promise<string | null> {
return redis.get<string>(`stream-cache:${queryHash}`);
}
export async function cacheStreamResponse(
queryHash: string,
response: string,
ttlSeconds: number = 3600
): Promise<void> {
await redis.set(`stream-cache:${queryHash}`, response, { ex: ttlSeconds });
}
export function computeQueryHash(
messages: Array<{ role: string; content: string }>,
model: string
): string {
const raw = JSON.stringify({ messages, model });
let hash = 0;
for (let i = 0; i < raw.length; i++) {
const char = raw.charCodeAt(i);
hash = ((hash << 5) - hash) + char;
hash |= 0;
}
return hash.toString(36);
}
Comparison: SSE vs WebSocket vs Long Polling
| Dimension | SSE | WebSocket | Long Polling |
|---|---|---|---|
| Direction | Server → Client only | Bidirectional | Server → Client only |
| Protocol | HTTP/1.1+ | ws/wss | HTTP |
| Auto-reconnect | Native browser support | Manual implementation | Each request is new |
| Proxy/CDN | Friendly (standard HTTP) | May be blocked | Fully compatible |
| Binary data | Not supported | Supported | Not supported |
| Frame overhead | Higher (text format) | Low (2 bytes) | High (HTTP headers each time) |
| Browser support | No IE | Broad support | Broad support |
| Connection limit | 6 per domain (HTTP/1.1) | No limit | No practical limit |
| LLM suitability | ★★★★★ | ★★★ | ★★ |
| Implementation complexity | ★★ | ★★★★ | ★ |
Conclusion: LLM streaming output is a classic one-way push scenario — SSE is the best choice. Only consider WebSocket when you need bidirectional real-time interaction (e.g., voice conversation, real-time collaborative editing).
Recommended Online Tools
- JSON Formatter - Debug LLM API response JSON data
- Base64 Encode/Decode - Handle encoding of API keys and sensitive information
- Code Formatter - Format TypeScript/React code
Related Articles
- Next.js Streaming SSR: 5 Production Patterns from Suspense to Progressive Rendering - Deep dive into Next.js Streaming SSR principles
- Next.js App Router Performance Optimization Guide - Complete App Router performance optimization
- Python SSE Streaming LLM - Backend SSE implementation reference
External Resources
- Vercel AI SDK Documentation - Complete AI SDK API reference
- MDN: Server-Sent Events - SSE protocol specification
Summary
The 6 production patterns for Next.js 15 streaming AI chat each have their ideal use cases:
| Pattern | Use Case | Complexity | Recommendation |
|---|---|---|---|
| SSE Streaming Response | Full protocol control needed | ★★★ | ★★★★ |
| React Server Actions | Type safety first, rapid development | ★★ | ★★★★ |
| Vercel AI SDK | Quick prototyping, multi-LLM support | ★ | ★★★★★ |
| Multi-Model Routing | Production-grade resilience, cost optimization | ★★★★ | ★★★★ |
| Conversation State Management | Long conversations, context-sensitive | ★★★★ | ★★★★ |
| Production Deployment Optimization | Enterprise launch | ★★★★★ | ★★★★★ |
Core recommendation: Start with Vercel AI SDK for rapid validation, gradually introduce multi-model routing and state management, then finalize production deployment. Next.js 15 streaming AI chat is no longer a technical challenge — the key is choosing the right pattern and avoiding common pitfalls.
Try these browser-local tools — no sign-up required →