Cloudflare Workers AI: Running Models at the Edge Without a GPU Bill

Most AI inference setups look like this: your app makes a network call to an API (OpenAI, Anthropic, Google), waits for the response, and returns it to the user. The latency is determined by the distance between your server and the model provider’s datacenter, plus whatever the model takes to generate a response.

Cloudflare Workers AI runs inference in Cloudflare’s network, which spans 330+ cities worldwide. When a user in Singapore hits your app, the inference runs in a datacenter near Singapore, not in US-East-1. For latency-sensitive applications, that gap matters.

What’s in the Catalog

Workers AI hosts open-weight models rather than frontier models. You won’t find GPT-4o or Claude here. What you do find:

Text generation:

Llama 3.1 (8B, 70B)
Mistral 7B Instruct
Phi-3 Mini and Medium
DeepSeek models (7B)
Qwen 2.5 variants

Text embedding:

bge-base-en-v1.5
bge-large-en-v1.5
multilingual-e5-large-instruct

Speech-to-text:

Whisper (large-v3-turbo)

Image generation:

Stable Diffusion XL
Flux

Vision:

LLaVA 1.5 7B (image + text input)
resnet-50 for image classification

Translation:

m2m100-1.2B (100 languages)

The catalog isn’t exhaustive, but it covers most common AI tasks: chat completions, classification, RAG embeddings, speech transcription, and image understanding.

Getting Started

Workers AI is available inside any Cloudflare Worker. With Wrangler:

npm create cloudflare@latest my-worker -- --type hello-world
cd my-worker

Bind the AI in your wrangler.toml:

name = "my-worker"
main = "src/index.ts"
compatibility_date = "2024-09-23"

[ai]
binding = "AI"

Running a text generation model:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const { text } = await request.json() as { text: string };

    const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
      messages: [
        {
          role: "system",
          content: "You are a helpful assistant. Be concise.",
        },
        {
          role: "user",
          content: text,
        },
      ],
      max_tokens: 512,
    });

    return Response.json(response);
  },
};

Streaming the response:

const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
  messages: [{ role: "user", content: text }],
  stream: true,
});

return new Response(stream, {
  headers: { "Content-Type": "text/event-stream" },
});

Embeddings for Semantic Search

The embedding endpoint integrates well with Vectorize (Cloudflare’s vector database):

// Generate an embedding for a search query
const queryEmbedding = await env.AI.run(
  "@cf/baai/bge-base-en-v1.5",
  { text: [query] }
);

// Search Vectorize for similar documents
const results = await env.VECTORIZE.query(queryEmbedding.data[0], {
  topK: 5,
  returnMetadata: true,
});

return Response.json(results.matches);

This combination — Workers AI for embeddings, Vectorize for storage, Workers for the API — is a self-contained semantic search stack that runs entirely in Cloudflare’s network. You don’t need to provision any external services.

Speech-to-Text with Whisper

Whisper large-v3-turbo via Workers AI:

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const formData = await request.formData();
    const audioFile = formData.get("audio") as File;
    const audioBuffer = await audioFile.arrayBuffer();

    const transcript = await env.AI.run(
      "@cf/openai/whisper-large-v3-turbo",
      {
        audio: [...new Uint8Array(audioBuffer)],
      }
    );

    return Response.json({ text: transcript.text });
  },
};

The audio input is the file bytes as a Uint8Array. Output is a text transcription. This is a clean way to add transcription to a Worker without routing audio to a separate service.

Pricing

Workers AI charges in “neurons” — a unit Cloudflare uses to normalize compute across different model sizes.

Free tier: 10,000 neurons/day (resets daily). Roughly 10,000 short Llama 3.1 8B completions, or 100,000+ embeddings.

Paid: $0.011 per 1,000 neurons on Workers Paid plans ($5/month).

For comparison, generating 1,000 tokens with Llama 3.1 8B costs roughly 0.2 neurons. At that rate, $0.011 buys you about 50,000 tokens.

For classification tasks, content moderation, or embedding generation at moderate scale, the numbers are favorable. For replacing a primary chat API, you’re looking at both latency and model quality trade-offs.

Where It Fits Well

Content moderation and classification. Running a smaller model to flag content before it reaches your database is a legitimate use case. The 8B models are good at binary or category classification tasks. Running this at the edge means moderation happens before the request reaches your origin.

Semantic search in a Worker. If you’re building search into an edge app, using Workers AI for embeddings and Vectorize for the index is the path of least resistance.

AI-powered middleware. Rewriting or augmenting API responses — summarizing, translating, extracting structured data — fits the Worker model well. The edge co-location means added latency from the AI call is minimized.

Prototyping and demos. The free tier is generous enough to prototype without a credit card. For internal demos, hackathons, or initial client proofs-of-concept, it works.

What It Doesn’t Do Well

Frontier model quality. Llama 3.1 8B is capable, but it’s not GPT-4o or Claude Sonnet. For tasks that need strong reasoning, complex instruction-following, or accurate function calling at scale, hosted frontier APIs perform better.

Long context. Workers AI model context windows are generally smaller than what frontier APIs offer. If you’re working with large documents, you’ll hit limits faster.

Predictable latency under load. Workers AI’s “serverless” nature means cold starts exist. For apps where response time needs to be consistently below 200ms on the first token, this is a variable to monitor.

Worker execution limits. Workers have a CPU time limit. Long-running inference on a 70B model may hit those limits. The 8B models are safer for Workers; larger models are better accessed via Pages Functions or from your origin.

The Practical Picture

Workers AI makes most sense as part of a broader Cloudflare-native stack, or for applications where the edge-inference latency improvement justifies the model quality trade-off. It’s not a drop-in replacement for OpenAI or Anthropic; the model catalog and quality tier are different.

For teams already using Cloudflare Workers for their APIs and Vectorize for vector search, adding Workers AI for embeddings or lightweight inference is a natural extension — same platform, same billing, no new credentials to manage.

For teams whose primary concern is model capability rather than latency, the hosted frontier providers are still the right call. Both have their place.

Cloudflare Workers AI: Running Models at the Edge Without a GPU Bill

What’s in the Catalog

Getting Started

Embeddings for Semantic Search

Speech-to-Text with Whisper

Pricing

Where It Fits Well

What It Doesn’t Do Well

The Practical Picture

Client Handoff Documentation That Gets Read After Launch

Inngest: Background Jobs Without the Queue Infrastructure

More from Cloud & Infrastructure

Turso and LibSQL: SQLite at the Edge for Production Applications

Blue-Green and Canary Deployments: A Production Guide for Engineering Teams

eBPF in 2026: The Observability Superpower Hiding in Your Linux Kernel

Join the conversation.