Cloud & Infrastructure · Edge Computing
Cloudflare Workers AI: Running Models at the Edge Without a GPU Bill
Workers AI gives you access to a catalog of open-weight models — Llama, Mistral, Whisper, embedding models — running in Cloudflare's network. Here's what's actually useful, what the limitations are, and when it makes sense.
Anurag Verma
5 min read
Sponsored
Most AI inference setups look like this: your app makes a network call to an API (OpenAI, Anthropic, Google), waits for the response, and returns it to the user. The latency is determined by the distance between your server and the model provider’s datacenter, plus whatever the model takes to generate a response.
Cloudflare Workers AI runs inference in Cloudflare’s network, which spans 330+ cities worldwide. When a user in Singapore hits your app, the inference runs in a datacenter near Singapore, not in US-East-1. For latency-sensitive applications, that gap matters.
What’s in the Catalog
Workers AI hosts open-weight models rather than frontier models. You won’t find GPT-4o or Claude here. What you do find:
Text generation:
- Llama 3.1 (8B, 70B)
- Mistral 7B Instruct
- Phi-3 Mini and Medium
- DeepSeek models (7B)
- Qwen 2.5 variants
Text embedding:
bge-base-en-v1.5bge-large-en-v1.5multilingual-e5-large-instruct
Speech-to-text:
- Whisper (large-v3-turbo)
Image generation:
- Stable Diffusion XL
- Flux
Vision:
- LLaVA 1.5 7B (image + text input)
resnet-50for image classification
Translation:
m2m100-1.2B(100 languages)
The catalog isn’t exhaustive, but it covers most common AI tasks: chat completions, classification, RAG embeddings, speech transcription, and image understanding.
Getting Started
Workers AI is available inside any Cloudflare Worker. With Wrangler:
npm create cloudflare@latest my-worker -- --type hello-world
cd my-worker
Bind the AI in your wrangler.toml:
name = "my-worker"
main = "src/index.ts"
compatibility_date = "2024-09-23"
[ai]
binding = "AI"
Running a text generation model:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const { text } = await request.json() as { text: string };
const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [
{
role: "system",
content: "You are a helpful assistant. Be concise.",
},
{
role: "user",
content: text,
},
],
max_tokens: 512,
});
return Response.json(response);
},
};
Streaming the response:
const stream = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
messages: [{ role: "user", content: text }],
stream: true,
});
return new Response(stream, {
headers: { "Content-Type": "text/event-stream" },
});
Embeddings for Semantic Search
The embedding endpoint integrates well with Vectorize (Cloudflare’s vector database):
// Generate an embedding for a search query
const queryEmbedding = await env.AI.run(
"@cf/baai/bge-base-en-v1.5",
{ text: [query] }
);
// Search Vectorize for similar documents
const results = await env.VECTORIZE.query(queryEmbedding.data[0], {
topK: 5,
returnMetadata: true,
});
return Response.json(results.matches);
This combination — Workers AI for embeddings, Vectorize for storage, Workers for the API — is a self-contained semantic search stack that runs entirely in Cloudflare’s network. You don’t need to provision any external services.
Speech-to-Text with Whisper
Whisper large-v3-turbo via Workers AI:
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const formData = await request.formData();
const audioFile = formData.get("audio") as File;
const audioBuffer = await audioFile.arrayBuffer();
const transcript = await env.AI.run(
"@cf/openai/whisper-large-v3-turbo",
{
audio: [...new Uint8Array(audioBuffer)],
}
);
return Response.json({ text: transcript.text });
},
};
The audio input is the file bytes as a Uint8Array. Output is a text transcription. This is a clean way to add transcription to a Worker without routing audio to a separate service.
Pricing
Workers AI charges in “neurons” — a unit Cloudflare uses to normalize compute across different model sizes.
Free tier: 10,000 neurons/day (resets daily). Roughly 10,000 short Llama 3.1 8B completions, or 100,000+ embeddings.
Paid: $0.011 per 1,000 neurons on Workers Paid plans ($5/month).
For comparison, generating 1,000 tokens with Llama 3.1 8B costs roughly 0.2 neurons. At that rate, $0.011 buys you about 50,000 tokens.
For classification tasks, content moderation, or embedding generation at moderate scale, the numbers are favorable. For replacing a primary chat API, you’re looking at both latency and model quality trade-offs.
Where It Fits Well
Content moderation and classification. Running a smaller model to flag content before it reaches your database is a legitimate use case. The 8B models are good at binary or category classification tasks. Running this at the edge means moderation happens before the request reaches your origin.
Semantic search in a Worker. If you’re building search into an edge app, using Workers AI for embeddings and Vectorize for the index is the path of least resistance.
AI-powered middleware. Rewriting or augmenting API responses — summarizing, translating, extracting structured data — fits the Worker model well. The edge co-location means added latency from the AI call is minimized.
Prototyping and demos. The free tier is generous enough to prototype without a credit card. For internal demos, hackathons, or initial client proofs-of-concept, it works.
What It Doesn’t Do Well
Frontier model quality. Llama 3.1 8B is capable, but it’s not GPT-4o or Claude Sonnet. For tasks that need strong reasoning, complex instruction-following, or accurate function calling at scale, hosted frontier APIs perform better.
Long context. Workers AI model context windows are generally smaller than what frontier APIs offer. If you’re working with large documents, you’ll hit limits faster.
Predictable latency under load. Workers AI’s “serverless” nature means cold starts exist. For apps where response time needs to be consistently below 200ms on the first token, this is a variable to monitor.
Worker execution limits. Workers have a CPU time limit. Long-running inference on a 70B model may hit those limits. The 8B models are safer for Workers; larger models are better accessed via Pages Functions or from your origin.
The Practical Picture
Workers AI makes most sense as part of a broader Cloudflare-native stack, or for applications where the edge-inference latency improvement justifies the model quality trade-off. It’s not a drop-in replacement for OpenAI or Anthropic; the model catalog and quality tier are different.
For teams already using Cloudflare Workers for their APIs and Vectorize for vector search, adding Workers AI for embeddings or lightweight inference is a natural extension — same platform, same billing, no new credentials to manage.
For teams whose primary concern is model capability rather than latency, the hosted frontier providers are still the right call. Both have their place.
Sponsored
More from this category
More from Cloud & Infrastructure
Turso and LibSQL: SQLite at the Edge for Production Applications
Blue-Green and Canary Deployments: A Production Guide for Engineering Teams
eBPF in 2026: The Observability Superpower Hiding in Your Linux Kernel
Sponsored
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored