Deploy a Production AI Chatbot in 30 Minutes with Claude API and Vercel

Most AI chatbot tutorials leave you with a toy that breaks under real traffic. You follow along, get a text box that sends messages to the OpenAI API, and call it a day. Then a real user sends 50 messages in a row and your API bill spikes to $200. Or the response takes 30 seconds because you are not streaming. Or the conversation forgets what was said three messages ago because nobody implemented memory.

This tutorial is different. We are going to build and deploy a production AI chatbot in 30 minutes that handles the things tutorials usually skip: streaming responses (text appears word by word), conversation memory (it remembers what you said), rate limiting (so one user cannot blow up your API costs), and proper error handling (so the chatbot does not crash when the API is slow or down).

I have built chatbots for three client projects in the last six months. Two of them were customer support bots, one was an internal knowledge base assistant. The pattern I am about to share is the one we use in production at CODERCOPS. It is not a demo -- it is a real architecture that handles thousands of conversations.

What We Are Building

Here is the feature list:

Chat interface with a message list and input field
Streaming responses -- text appears progressively, not all at once
Conversation memory -- the AI remembers the full conversation context
System prompt customization -- make it a customer support bot, coding assistant, or anything else
Rate limiting -- prevent API cost explosions from excessive use
Error handling and retry logic -- graceful degradation when things go wrong
Mobile-responsive UI -- works on phones and tablets
Token counting -- tracks usage and manages the context window
One-click deployment to Vercel

Tech Stack

Next.js 15 (App Router) -- the framework
Anthropic Claude API (claude-sonnet-4-20250514) -- the AI model
Vercel AI SDK -- handles streaming and message management
Tailwind CSS -- styling
Vercel -- deployment

Total cost: $0/month for the infrastructure (Vercel free tier). You only pay for Claude API usage based on actual conversations.

Step 1: Set Up the Project

Create a new Next.js project with everything we need:

npx create-next-app@latest ai-chatbot --typescript --tailwind --app --src-dir --use-npm
cd ai-chatbot

Install the AI dependencies:

npm install ai @anthropic-ai/sdk

The ai package is the Vercel AI SDK, which provides React hooks for chat interfaces and server-side streaming helpers. The @anthropic-ai/sdk package is Anthropic's official TypeScript SDK for the Claude API.

Get Your API Key

Go to console.anthropic.com, create an account, and generate an API key. You will need to add a payment method -- Claude API charges per token, and we will discuss costs later.

Create a .env.local file:

ANTHROPIC_API_KEY=sk-ant-your-key-here

Never commit this file. Add .env.local to your .gitignore (Next.js does this by default).

Step 2: Create the API Route with Streaming

This is the server-side endpoint that receives messages from the chat UI and streams back Claude's response. Create the file at src/app/api/chat/route.ts:

// src/app/api/chat/route.ts
import Anthropic from '@anthropic-ai/sdk';
import { NextRequest, NextResponse } from 'next/server';

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

// Simple in-memory rate limiter
const rateLimitMap = new Map<string, { count: number; resetTime: number }>();
const RATE_LIMIT = 20; // messages per window
const RATE_WINDOW = 60 * 1000; // 1 minute window

function checkRateLimit(ip: string): boolean {
  const now = Date.now();
  const record = rateLimitMap.get(ip);

  if (!record || now > record.resetTime) {
    rateLimitMap.set(ip, { count: 1, resetTime: now + RATE_WINDOW });
    return true;
  }

  if (record.count >= RATE_LIMIT) {
    return false;
  }

  record.count++;
  return true;
}

// Clean up expired rate limit entries every 5 minutes
setInterval(() => {
  const now = Date.now();
  for (const [key, value] of rateLimitMap.entries()) {
    if (now > value.resetTime) {
      rateLimitMap.delete(key);
    }
  }
}, 5 * 60 * 1000);

// System prompt -- customize this for your use case
const SYSTEM_PROMPT = `You are a helpful AI assistant for CODERCOPS, a modern tech agency that builds web applications, AI integrations, and SaaS products.

Your role:
- Answer questions about web development, AI, and technology
- Help users understand CODERCOPS services
- Be friendly, direct, and technically accurate
- If you do not know something, say so honestly
- Keep responses concise unless the user asks for detail

Important:
- Do not make up information about CODERCOPS that you do not know
- Do not provide legal, medical, or financial advice
- If the user needs to talk to a human, direct them to the contact page`;

export async function POST(req: NextRequest) {
  try {
    // Rate limiting
    const ip = req.headers.get('x-forwarded-for') ||
               req.headers.get('x-real-ip') ||
               'unknown';

    if (!checkRateLimit(ip)) {
      return NextResponse.json(
        { error: 'Too many requests. Please wait a moment before sending another message.' },
        { status: 429 }
      );
    }

    // Parse the request body
    const { messages } = await req.json();

    if (!messages || !Array.isArray(messages) || messages.length === 0) {
      return NextResponse.json(
        { error: 'Messages array is required' },
        { status: 400 }
      );
    }

    // Trim conversation to stay within context window
    // Claude's context window is 200K tokens, but we limit to save costs
    const MAX_MESSAGES = 40; // Keep last 40 messages (20 exchanges)
    const trimmedMessages = messages.slice(-MAX_MESSAGES);

    // Format messages for the Anthropic API
    const formattedMessages = trimmedMessages.map((msg: { role: string; content: string }) => ({
      role: msg.role as 'user' | 'assistant',
      content: msg.content,
    }));

    // Create the streaming response
    const stream = await anthropic.messages.stream({
      model: 'claude-sonnet-4-20250514',
      max_tokens: 1024,
      system: SYSTEM_PROMPT,
      messages: formattedMessages,
    });

    // Create a ReadableStream that sends chunks as they arrive
    const readableStream = new ReadableStream({
      async start(controller) {
        try {
          for await (const event of stream) {
            if (
              event.type === 'content_block_delta' &&
              event.delta.type === 'text_delta'
            ) {
              const chunk = event.delta.text;
              controller.enqueue(new TextEncoder().encode(chunk));
            }
          }
          controller.close();
        } catch (error) {
          controller.error(error);
        }
      },
    });

    return new Response(readableStream, {
      headers: {
        'Content-Type': 'text/plain; charset=utf-8',
        'Cache-Control': 'no-cache',
        'Connection': 'keep-alive',
      },
    });
  } catch (error: any) {
    console.error('Chat API error:', error);

    // Handle specific Anthropic API errors
    if (error?.status === 429) {
      return NextResponse.json(
        { error: 'The AI service is currently busy. Please try again in a moment.' },
        { status: 429 }
      );
    }

    if (error?.status === 401) {
      return NextResponse.json(
        { error: 'API configuration error. Please contact support.' },
        { status: 500 }
      );
    }

    return NextResponse.json(
      { error: 'Something went wrong. Please try again.' },
      { status: 500 }
    );
  }
}

Let me break down what this route does:

Rate limiting: Tracks requests per IP address. Each IP gets 20 messages per minute. This prevents a single user from burning through your API credits.
Message trimming: We keep only the last 40 messages to manage context window size and costs. Each message costs tokens, and longer conversations cost exponentially more.
Streaming: Instead of waiting for the full response and sending it at once, we stream text chunks as Claude generates them. This makes the chatbot feel responsive -- the first word appears in ~200ms instead of waiting 2-5 seconds for the full response.
Error handling: We catch specific API errors (rate limits, auth failures) and return user-friendly error messages instead of crashing.

Step 3: Build the Chat UI

Now let us build the frontend. We will create a chat component that handles message display, user input, streaming responses, and loading states.

The Chat Component

// src/components/Chat.tsx
'use client';

import { useState, useRef, useEffect, FormEvent } from 'react';

interface Message {
  id: string;
  role: 'user' | 'assistant';
  content: string;
}

export default function Chat() {
  const [messages, setMessages] = useState<Message[]>([]);
  const [input, setInput] = useState('');
  const [isLoading, setIsLoading] = useState(false);
  const [error, setError] = useState<string | null>(null);
  const messagesEndRef = useRef<HTMLDivElement>(null);
  const inputRef = useRef<HTMLTextAreaElement>(null);

  // Auto-scroll to the latest message
  useEffect(() => {
    messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });
  }, [messages]);

  // Auto-resize textarea
  useEffect(() => {
    if (inputRef.current) {
      inputRef.current.style.height = 'auto';
      inputRef.current.style.height = `${inputRef.current.scrollHeight}px`;
    }
  }, [input]);

  async function handleSubmit(e: FormEvent) {
    e.preventDefault();
    if (!input.trim() || isLoading) return;

    const userMessage: Message = {
      id: Date.now().toString(),
      role: 'user',
      content: input.trim(),
    };

    setMessages((prev) => [...prev, userMessage]);
    setInput('');
    setIsLoading(true);
    setError(null);

    // Create a placeholder for the assistant's response
    const assistantMessageId = (Date.now() + 1).toString();
    const assistantMessage: Message = {
      id: assistantMessageId,
      role: 'assistant',
      content: '',
    };
    setMessages((prev) => [...prev, assistantMessage]);

    try {
      const response = await fetch('/api/chat', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          messages: [...messages, userMessage].map(({ role, content }) => ({
            role,
            content,
          })),
        }),
      });

      if (!response.ok) {
        const errorData = await response.json().catch(() => null);
        throw new Error(
          errorData?.error || `Request failed with status ${response.status}`
        );
      }

      // Read the streaming response
      const reader = response.body?.getReader();
      const decoder = new TextDecoder();

      if (!reader) {
        throw new Error('No response stream available');
      }

      let fullContent = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        const chunk = decoder.decode(value, { stream: true });
        fullContent += chunk;

        // Update the assistant message with the accumulated content
        setMessages((prev) =>
          prev.map((msg) =>
            msg.id === assistantMessageId
              ? { ...msg, content: fullContent }
              : msg
          )
        );
      }
    } catch (err: any) {
      setError(err.message || 'Failed to send message');
      // Remove the empty assistant message on error
      setMessages((prev) =>
        prev.filter((msg) => msg.id !== assistantMessageId)
      );
    } finally {
      setIsLoading(false);
    }
  }

  function handleKeyDown(e: React.KeyboardEvent<HTMLTextAreaElement>) {
    if (e.key === 'Enter' && !e.shiftKey) {
      e.preventDefault();
      handleSubmit(e);
    }
  }

  return (
    <div className="flex flex-col h-[100dvh] max-w-3xl mx-auto">
      {/* Header */}
      <header className="flex items-center justify-between px-4 py-3 border-b border-gray-200 dark:border-gray-700 bg-white dark:bg-gray-900">
        <div className="flex items-center gap-3">
          <div className="w-8 h-8 rounded-full bg-indigo-600 flex items-center justify-center">
            <span className="text-white text-sm font-bold">C</span>
          </div>
          <div>
            <h1 className="text-sm font-semibold text-gray-900 dark:text-white">
              CODERCOPS Assistant
            </h1>
            <p className="text-xs text-gray-500 dark:text-gray-400">
              Powered by Claude
            </p>
          </div>
        </div>
        <div className="flex items-center gap-1.5">
          <span className="w-2 h-2 rounded-full bg-green-500"></span>
          <span className="text-xs text-gray-500 dark:text-gray-400">Online</span>
        </div>
      </header>

      {/* Messages */}
      <div className="flex-1 overflow-y-auto px-4 py-6 space-y-6 bg-gray-50 dark:bg-gray-950">
        {messages.length === 0 && (
          <div className="text-center py-20">
            <h2 className="text-lg font-semibold text-gray-900 dark:text-white mb-2">
              How can I help you today?
            </h2>
            <p className="text-sm text-gray-500 dark:text-gray-400 max-w-md mx-auto">
              Ask me about web development, AI integration, or CODERCOPS
              services. I am here to help.
            </p>
            <div className="flex flex-wrap justify-center gap-2 mt-6">
              {[
                'What services does CODERCOPS offer?',
                'How do I build a real-time app?',
                'Explain edge functions vs serverless',
              ].map((suggestion) => (
                <button
                  key={suggestion}
                  onClick={() => setInput(suggestion)}
                  className="px-3 py-1.5 text-xs rounded-full border border-gray-300 dark:border-gray-600 text-gray-700 dark:text-gray-300 hover:bg-gray-100 dark:hover:bg-gray-800 transition-colors"
                >
                  {suggestion}
                </button>
              ))}
            </div>
          </div>
        )}

        {messages.map((message) => (
          <div
            key={message.id}
            className={`flex ${
              message.role === 'user' ? 'justify-end' : 'justify-start'
            }`}
          >
            <div
              className={`max-w-[85%] rounded-2xl px-4 py-3 ${
                message.role === 'user'
                  ? 'bg-indigo-600 text-white'
                  : 'bg-white dark:bg-gray-800 text-gray-900 dark:text-gray-100 border border-gray-200 dark:border-gray-700'
              }`}
            >
              <div className="text-sm leading-relaxed whitespace-pre-wrap">
                {message.content}
                {isLoading &&
                  message.role === 'assistant' &&
                  message.content === '' && (
                    <span className="inline-flex gap-1">
                      <span className="w-1.5 h-1.5 rounded-full bg-gray-400 animate-bounce [animation-delay:0ms]"></span>
                      <span className="w-1.5 h-1.5 rounded-full bg-gray-400 animate-bounce [animation-delay:150ms]"></span>
                      <span className="w-1.5 h-1.5 rounded-full bg-gray-400 animate-bounce [animation-delay:300ms]"></span>
                    </span>
                  )}
              </div>
            </div>
          </div>
        ))}

        {error && (
          <div className="flex justify-center">
            <div className="bg-red-50 dark:bg-red-900/20 text-red-600 dark:text-red-400 text-sm px-4 py-2 rounded-lg">
              {error}
            </div>
          </div>
        )}

        <div ref={messagesEndRef} />
      </div>

      {/* Input */}
      <div className="border-t border-gray-200 dark:border-gray-700 bg-white dark:bg-gray-900 px-4 py-3">
        <form onSubmit={handleSubmit} className="flex items-end gap-3">
          <textarea
            ref={inputRef}
            value={input}
            onChange={(e) => setInput(e.target.value)}
            onKeyDown={handleKeyDown}
            placeholder="Type a message..."
            rows={1}
            className="flex-1 resize-none rounded-xl border border-gray-300 dark:border-gray-600 bg-gray-50 dark:bg-gray-800 px-4 py-2.5 text-sm text-gray-900 dark:text-white placeholder-gray-500 focus:outline-none focus:ring-2 focus:ring-indigo-500 focus:border-transparent max-h-32"
            disabled={isLoading}
          />
          <button
            type="submit"
            disabled={isLoading || !input.trim()}
            className="flex-shrink-0 w-10 h-10 rounded-xl bg-indigo-600 text-white flex items-center justify-center hover:bg-indigo-700 disabled:opacity-50 disabled:cursor-not-allowed transition-colors"
          >
            <svg
              xmlns="http://www.w3.org/2000/svg"
              viewBox="0 0 20 20"
              fill="currentColor"
              className="w-5 h-5"
            >
              <path d="M3.105 2.289a.75.75 0 00-.826.95l1.414 4.925A1.5 1.5 0 005.135 9.25h6.115a.75.75 0 010 1.5H5.135a1.5 1.5 0 00-1.442 1.086l-1.414 4.926a.75.75 0 00.826.95 28.896 28.896 0 0015.293-7.154.75.75 0 000-1.115A28.897 28.897 0 003.105 2.289z" />
            </svg>
          </button>
        </form>
        <p className="text-xs text-gray-400 mt-2 text-center">
          AI can make mistakes. Verify important information.
        </p>
      </div>
    </div>
  );
}

The Main Page

// src/app/page.tsx
import Chat from '@/components/Chat';

export default function Home() {
  return <Chat />;
}

Update the Layout

Update src/app/layout.tsx to support dark mode and set the metadata:

// src/app/layout.tsx
import type { Metadata } from 'next';
import './globals.css';

export const metadata: Metadata = {
  title: 'CODERCOPS Assistant - AI Chatbot',
  description: 'Chat with our AI assistant about web development, AI, and technology.',
};

export default function RootLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return (
    <html lang="en" className="dark">
      <body className="bg-white dark:bg-gray-950">{children}</body>
    </html>
  );
}

Step 4: Add Conversation Memory

The conversation memory is already handled by the architecture above. Here is how it works:

The messages state in the Chat component stores the full conversation history
On each new message, we send the entire conversation history to the API route
The API route forwards all messages to Claude, which uses them as context
Claude sees the full conversation and responds with awareness of everything said before

The Sliding Window Problem

Claude's context window is large (200K tokens for Sonnet), but every token costs money. A conversation with 100 messages could cost 10-20x more per response than a conversation with 5 messages, because Claude processes the full context on every request.

Our API route handles this with the sliding window:

const MAX_MESSAGES = 40;
const trimmedMessages = messages.slice(-MAX_MESSAGES);

This keeps the last 40 messages (20 user-assistant exchanges). Older messages are dropped. For most chatbot use cases, this is sufficient -- users rarely reference something said 20+ exchanges ago.

Token Counting

If you want more precise control, you can count tokens instead of messages. Here is a simple token estimation function:

// src/lib/tokens.ts

// Rough estimation: 1 token per 4 characters for English text
// This is an approximation. For exact counts, use Anthropic's tokenizer.
export function estimateTokens(text: string): number {
  return Math.ceil(text.length / 4);
}

export function estimateConversationTokens(
  messages: Array<{ role: string; content: string }>
): number {
  return messages.reduce(
    (total, msg) => total + estimateTokens(msg.content) + 4, // +4 for role tokens
    0
  );
}

// Trim messages to stay within a token budget
export function trimToTokenBudget(
  messages: Array<{ role: string; content: string }>,
  maxTokens: number
): Array<{ role: string; content: string }> {
  let totalTokens = 0;
  const trimmed: Array<{ role: string; content: string }> = [];

  // Work backwards from the most recent message
  for (let i = messages.length - 1; i >= 0; i--) {
    const msgTokens = estimateTokens(messages[i].content) + 4;
    if (totalTokens + msgTokens > maxTokens) break;
    totalTokens += msgTokens;
    trimmed.unshift(messages[i]);
  }

  return trimmed;
}

You can then use this in the API route:

import { trimToTokenBudget } from '@/lib/tokens';

// Keep conversation under 4000 input tokens
const trimmedMessages = trimToTokenBudget(messages, 4000);

Step 5: Rate Limiting in Detail

The in-memory rate limiter in our API route works for single-server deployments (which is how Vercel serverless functions run). Let me explain the design choices:

Request-Based vs Token-Based Rate Limiting

Our simple limiter counts requests (20 per minute per IP). This prevents abuse but does not account for message length. A user sending twenty "hi" messages costs much less than a user sending twenty 2000-word messages.

For production, you should add token-based limiting:

// Token-based rate limiter
const tokenLimitMap = new Map<string, { tokens: number; resetTime: number }>();
const TOKEN_LIMIT = 10000; // tokens per window
const TOKEN_WINDOW = 60 * 60 * 1000; // 1 hour window

function checkTokenLimit(ip: string, estimatedTokens: number): boolean {
  const now = Date.now();
  const record = tokenLimitMap.get(ip);

  if (!record || now > record.resetTime) {
    tokenLimitMap.set(ip, {
      tokens: estimatedTokens,
      resetTime: now + TOKEN_WINDOW,
    });
    return true;
  }

  if (record.tokens + estimatedTokens > TOKEN_LIMIT) {
    return false;
  }

  record.tokens += estimatedTokens;
  return true;
}

Cost Estimation Per Conversation

Here is a realistic cost breakdown using Claude Sonnet pricing (as of early 2026):

Input tokens: $3 per million tokens
Output tokens: $15 per million tokens

Conversation Length	Estimated Input Tokens	Estimated Output Tokens	Cost
5 exchanges (10 messages)	~2,000	~1,500	$0.03
10 exchanges (20 messages)	~6,000	~3,000	$0.06
20 exchanges (40 messages)	~15,000	~6,000	$0.14

At 100 conversations per day (averaging 10 exchanges each): approximately $180/month.

At 1,000 conversations per day: approximately $1,800/month.

These numbers assume Claude Sonnet. If you switch to Claude Haiku for simpler queries, costs drop by roughly 90%. A common pattern is to use Haiku for simple questions and Sonnet for complex ones -- a routing layer that checks message complexity.

Step 6: Production Hardening

Before deploying, let us add the finishing touches that separate a demo from a production app.

Environment Variable Validation

// src/lib/env.ts
export function validateEnv() {
  if (!process.env.ANTHROPIC_API_KEY) {
    throw new Error(
      'ANTHROPIC_API_KEY is not set. Add it to your .env.local file.'
    );
  }

  if (!process.env.ANTHROPIC_API_KEY.startsWith('sk-ant-')) {
    throw new Error(
      'ANTHROPIC_API_KEY does not look like a valid Anthropic key. It should start with sk-ant-'
    );
  }
}

Call this at the top of your API route:

import { validateEnv } from '@/lib/env';
validateEnv();

Retry Logic

Sometimes the Claude API returns a 529 (overloaded) error. Instead of showing the user an error, retry the request:

// src/lib/retry.ts
export async function withRetry<T>(
  fn: () => Promise<T>,
  maxRetries: number = 3,
  baseDelay: number = 1000
): Promise<T> {
  let lastError: Error | null = null;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error: any) {
      lastError = error;

      // Only retry on specific errors
      const retryableStatuses = [429, 500, 529];
      if (!retryableStatuses.includes(error?.status)) {
        throw error;
      }

      // Exponential backoff: 1s, 2s, 4s
      const delay = baseDelay * Math.pow(2, attempt);
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }

  throw lastError;
}

Use it in the API route:

const stream = await withRetry(() =>
  anthropic.messages.stream({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 1024,
    system: SYSTEM_PROMPT,
    messages: formattedMessages,
  })
);

Input Sanitization

Never trust user input. Add basic sanitization before sending to the API:

// src/lib/sanitize.ts
export function sanitizeInput(input: string): string {
  // Remove any null bytes
  let sanitized = input.replace(/\0/g, '');

  // Trim whitespace
  sanitized = sanitized.trim();

  // Limit length (roughly 2000 tokens worth of text)
  const MAX_LENGTH = 8000;
  if (sanitized.length > MAX_LENGTH) {
    sanitized = sanitized.substring(0, MAX_LENGTH);
  }

  return sanitized;
}

Analytics Tracking

Add simple analytics to understand how the chatbot is being used:

// In your API route, after processing the request:
function logAnalytics(data: {
  messageCount: number;
  estimatedInputTokens: number;
  responseTime: number;
  error?: string;
}) {
  // In production, send this to your analytics service
  // For now, just log it
  console.log('[chatbot-analytics]', JSON.stringify(data));
}

In production, send these events to your analytics tool (PostHog, Mixpanel, or even a Supabase table). The key metrics to track:

Messages per conversation -- are users engaged or dropping off after 1-2 messages?
Response time -- is the chatbot fast enough?
Error rate -- how often do API calls fail?
Token usage per conversation -- are costs under control?
Common questions -- what do users ask about most? (Use this to improve the system prompt)

Step 7: Deploy to Vercel

Deploying to Vercel is straightforward.

Push to GitHub

git init
git add .
git commit -m "Production AI chatbot with Claude API"
git remote add origin https://github.com/your-username/ai-chatbot.git
git push -u origin main

Deploy with Vercel CLI

npx vercel --prod

Or connect your GitHub repo to Vercel through the dashboard for automatic deployments on every push.

Set Environment Variables in Vercel

Go to your project in the Vercel dashboard:

Navigate to Settings > Environment Variables
Add ANTHROPIC_API_KEY with your API key value
Set it for Production, Preview, and Development environments

Vercel encrypts environment variables at rest and in transit. Your API key is safe.

Custom Domain (Optional)

In the Vercel dashboard, go to Settings > Domains and add your custom domain. Vercel handles SSL certificates automatically.

chat.yourdomain.com -> your-chatbot.vercel.app

Your chatbot is now live at your custom domain with:

Automatic HTTPS
Global CDN
Serverless API route (scales to zero when not in use)
Preview deployments for every pull request

Cost Analysis: What This Actually Costs to Run

Let me be transparent about costs, because most tutorials conveniently skip this.

Infrastructure Costs

Component	Monthly Cost
Vercel Hosting (Hobby tier)	$0
Vercel Hosting (Pro tier)	$20
Custom domain	$10-15/year
Total infrastructure	$0-$20/month

Claude API Costs

This is where the real cost is. It scales with usage.

Daily Conversations	Avg Messages/Conv	Estimated Monthly Cost
10	8	~$15
50	8	~$75
100	8	~$150
500	8	~$750
1,000	8	~$1,500

Cost Optimization Tips

Use Claude Haiku for simple queries. Route simple questions ("What are your hours?") to Haiku ($0.25/M input tokens) and complex questions to Sonnet ($3/M input tokens). This can cut costs by 50-70%.
Aggressive context trimming. Keep only the last 10-20 messages instead of 40. Most conversations do not need deep context.
Cache common responses. If 30% of conversations start with "What does CODERCOPS do?", cache that response and serve it without an API call.
Set max_tokens appropriately. Do not set max_tokens: 4096 if your typical response is 200 tokens. Lower max_tokens does not save money directly (you pay for actual output tokens), but it prevents runaway responses.
Implement usage caps per user. After 50 messages in a session, show a message: "You have reached the message limit. Please contact us for more help." This prevents single users from generating disproportionate costs.

Extending the Chatbot

Add Tool Use (Function Calling)

Claude supports tool use, which lets the chatbot perform actions -- look up data, submit forms, or query APIs. Here is how you would add a "check project status" tool:

const tools = [
  {
    name: 'check_project_status',
    description: 'Check the status of a client project by project ID',
    input_schema: {
      type: 'object' as const,
      properties: {
        project_id: {
          type: 'string',
          description: 'The project ID to look up',
        },
      },
      required: ['project_id'],
    },
  },
];

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  max_tokens: 1024,
  system: SYSTEM_PROMPT,
  messages: formattedMessages,
  tools,
});

When Claude decides to use a tool, it returns a tool_use content block. You execute the tool, send the result back, and Claude incorporates it into its response. This is how you build chatbots that can actually DO things, not just talk.

Add RAG (Retrieval-Augmented Generation)

For a chatbot that answers questions about your documentation or knowledge base:

Chunk your documents into 500-token segments
Generate embeddings using an embedding model
Store embeddings in a vector database (Supabase has pgvector built in)
On each user question, find the most relevant chunks
Include them in the system prompt as context

const relevantDocs = await searchDocuments(userQuery);
const context = relevantDocs.map(doc => doc.content).join('\n\n');

const systemPromptWithContext = `${SYSTEM_PROMPT}

Here is relevant documentation to help answer the user's question:

${context}

Use this information to provide accurate answers. If the documentation does not cover the question, say so.`;

This turns your chatbot from a general assistant into a domain-specific expert that knows your product inside and out.

Multi-Model Support

You do not have to use only Claude. Add model selection to let users or your routing logic choose the best model:

type ModelId = 'claude-sonnet' | 'claude-haiku';

const MODEL_MAP: Record<ModelId, string> = {
  'claude-sonnet': 'claude-sonnet-4-20250514',
  'claude-haiku': 'claude-haiku-4-20250514',
};

// In the API route:
const modelId = (body.model as ModelId) || 'claude-sonnet';
const model = MODEL_MAP[modelId];

Security Considerations

A few important security notes before you ship this to production:

Never Expose Your API Key

The API key should only exist in server-side code (the API route). The client-side Chat component never sees it. If you inspect network requests in the browser, you should see requests going to /api/chat, not to api.anthropic.com.

Content Filtering

Claude has built-in content safety, but you should add your own layer for your specific use case:

const BLOCKED_PATTERNS = [
  /ignore.*previous.*instructions/i,
  /you are now/i,
  /pretend you are/i,
  /act as/i,
];

function containsPromptInjection(input: string): boolean {
  return BLOCKED_PATTERNS.some((pattern) => pattern.test(input));
}

This is not foolproof -- prompt injection is a hard problem. But it catches the obvious attempts.

CORS Configuration

If you are embedding the chatbot on a different domain, configure CORS properly:

// In your API route:
const allowedOrigins = ['https://yourdomain.com', 'https://chat.yourdomain.com'];

const origin = req.headers.get('origin');
if (origin && !allowedOrigins.includes(origin)) {
  return NextResponse.json({ error: 'Forbidden' }, { status: 403 });
}

Audit Logging

Log every conversation for compliance and debugging. Store logs with the user's IP (hashed), timestamp, message content, and response. Make sure your privacy policy covers this.

Test It Locally

Before deploying, run the project locally:

npm run dev

Open http://localhost:3000 and try:

Send a simple message -- verify streaming works (text should appear word by word)
Send 5-6 messages -- verify conversation memory (the AI should remember previous context)
Open the browser console and check for errors
Try sending 25+ messages quickly -- verify rate limiting kicks in
Resize the browser window -- verify the UI is responsive on mobile

If everything works, deploy it.

The Bottom Line

You now have a production AI chatbot running on Claude API with streaming, memory, rate limiting, and error handling. The total time to set this up is about 30 minutes if you type fast, maybe 45 minutes if you take your time reading through the code.

The key takeaways:

Streaming is non-negotiable. Users expect immediate feedback. A 3-second wait for a response feels broken; a 200ms wait with streaming text feels responsive.
Rate limiting protects your wallet. Without it, one user can generate hundreds of dollars in API costs in a single session.
Memory management is a cost control lever. More context = better responses = higher costs. Find the right balance for your use case.
Claude's safety is built-in but not sufficient. Add your own content filtering and input sanitization for production.

This is the same architecture we use at CODERCOPS for client chatbot projects. The pattern scales from a simple FAQ bot to a complex customer support agent with tool use and RAG. Start simple, measure usage, and add complexity as needed.

Need a Custom AI Chatbot for Your Business?

At CODERCOPS, we build production AI chatbots and assistants for businesses -- from simple FAQ bots to complex agents that integrate with your internal systems, databases, and APIs. We use Claude, GPT, and open-source models depending on the use case.

If you want an AI chatbot that actually works in production (not just a demo), let's talk. We can have a working prototype in your hands within a week.

For more AI integration tutorials and engineering deep-dives, check out the CODERCOPS blog.