Last year we built an AI feature for a client's Django application. The standard advice said: spin up a separate FastAPI service for model inference, have Django call it via HTTP, keep concerns separated. Clean architecture. Best practices.
So we did. And then we measured the latency.
The FastAPI service ran inference in 45ms. The network round-trip between Django and FastAPI (both running on the same Kubernetes cluster) added 80ms. Authentication between the services added another 15ms. JSON serialization of the request and deserialization of the response added 12ms. The "clean architecture" overhead was longer than the actual AI work.
We moved the model into the Django process the next week. Total inference time dropped from 152ms to 50ms. We deleted a Dockerfile, a Kubernetes deployment, a service mesh configuration, an internal API client library, and a set of integration tests that existed solely to verify the two services could talk to each other.
This is not always the right call. There are legitimate reasons to run models in a separate service. But the default assumption that "ML must be a microservice" is costing teams time, money, and latency for no good reason. Here is when the monolith wins, and how to do it properly.
The Case Against the Microservice Split
The microservice approach sounds right in a conference talk: "Separate your ML inference from your application logic. Scale them independently. Use the best tool for each job."
In practice, here is what "separate" actually means for a typical team:
| Concern | Monolith (Django) | Microservice (Django + FastAPI) |
|---|---|---|
| Repositories | 1 | 2 |
| Deployment pipelines | 1 | 2 |
| Docker images | 1 | 2 |
| Authentication systems | 1 (Django auth) | 2 (Django auth + service-to-service auth) |
| API contracts to maintain | 0 (function calls) | 1 (internal REST/gRPC API) |
| Integration tests | Standard Django tests | Cross-service integration tests |
| Monitoring dashboards | 1 | 2 (plus inter-service latency) |
| Infrastructure cost | 1 server/pod | 2 servers/pods (ML service needs GPU or high memory) |
| Debugging a request | One log stream | Distributed tracing across two services |
For a team of 3-8 engineers working on a product where AI is a feature (not the entire product), the microservice split doubles your operational surface area. Every deployment is coordinated across two services. Every schema change in the ML response requires updating two codebases. Every on-call incident requires checking two sets of logs.
The actual cost comparison from our client project:
| Item | Microservice Setup | Monolith Setup | Monthly Savings |
|---|---|---|---|
| Kubernetes pods | 4 (2 Django + 2 FastAPI) | 3 (Django with more memory) | $85 |
| CI/CD pipeline minutes | 2x (two builds, two deployments) | 1x | $25 |
| Engineer hours on infra | ~8 hrs/month (service mesh, auth, debugging) | ~2 hrs/month | ~$600 (at loaded cost) |
| Latency overhead | 107ms per AI request | 5ms (function call) | N/A (but impacts UX) |
The $710/month difference is not huge for a well-funded team. But the 6 engineer-hours difference is. Those hours compound -- they are 6 hours not spent on features, not spent improving the model, not spent talking to users.
Loading Models in Django
The most common question: "If I load a 500MB model in Django, won't it eat all my memory?"
Yes, if you do it wrong. Here is how to do it right.
The AppConfig.ready() Pattern
Django's AppConfig.ready() method runs once when the application starts. This is where you load your models:
# ai_features/apps.py
import logging
from django.apps import AppConfig
logger = logging.getLogger(__name__)
class AiFeaturesConfig(AppConfig):
name = "ai_features"
def ready(self):
# Import here to avoid circular imports
from .model_registry import ModelRegistry
ModelRegistry.initialize()
logger.info("AI models loaded successfully")# ai_features/model_registry.py
import torch
from sentence_transformers import SentenceTransformer
from pathlib import Path
class ModelRegistry:
"""Singleton registry for ML models. Loaded once at startup."""
_models = {}
_initialized = False
@classmethod
def initialize(cls):
if cls._initialized:
return
model_dir = Path(__file__).parent / "models"
# Load embedding model (~120MB)
cls._models["embeddings"] = SentenceTransformer(
str(model_dir / "all-MiniLM-L6-v2"),
device="cpu", # or "cuda" if you have a GPU
)
# Load classification model (~85MB)
cls._models["classifier"] = torch.jit.load(
str(model_dir / "content_classifier.pt"),
map_location="cpu",
)
cls._models["classifier"].eval() # Set to inference mode
cls._initialized = True
logger.info(f"Loaded {len(cls._models)} models")
@classmethod
def get(cls, name: str):
if not cls._initialized:
raise RuntimeError("ModelRegistry not initialized. Check AppConfig.ready()")
if name not in cls._models:
raise KeyError(f"Unknown model: {name}. Available: {list(cls._models.keys())}")
return cls._models[name]# ai_features/views.py
from .model_registry import ModelRegistry
def classify_content(request):
text = request.POST["content"]
model = ModelRegistry.get("classifier")
with torch.no_grad():
prediction = model(tokenize(text))
category = LABELS[prediction.argmax().item()]
return JsonResponse({"category": category, "confidence": prediction.max().item()})The model loads once when gunicorn starts, and all workers share the parent process's memory via copy-on-write (when using --preload). A 500MB model does not use 500MB per worker -- it uses 500MB total.
Model Size Guidelines
Not every model belongs in your Django process. Here is our rule of thumb:
| Model Size | Strategy | Example |
|---|---|---|
| < 100MB | Load in Django process | Scikit-learn models, small transformers, TF-Lite |
| 100MB - 1GB | Load in Django with --preload |
Sentence transformers, medium classification models |
| 1GB - 5GB | Consider separate process on same machine | Larger transformer models, image models |
| > 5GB | Separate service or managed inference (SageMaker, Replicate) | Large language models, Stable Diffusion |
The dividing line is not about capability -- it is about memory and startup time. A 2GB model takes 15-30 seconds to load, which means slow deploys and slow autoscaling. Above 5GB, you are typically looking at GPU inference, and that justifies a separate service because GPU instances have different scaling characteristics.
Streaming LLM Responses from Django
This is the pattern that most teams are asking about in 2026: streaming responses from language models. The user asks a question, the LLM generates tokens one at a time, and you want to stream them to the browser as they are produced.
Django handles this beautifully with StreamingHttpResponse and async views:
# ai_features/views.py
import anthropic
from django.http import StreamingHttpResponse
client = anthropic.AsyncAnthropic() # Uses ANTHROPIC_API_KEY env var
async def chat_response(request):
user_message = request.POST["message"]
conversation = await get_conversation_history(request.user, request.POST["conversation_id"])
async def generate():
async with client.messages.stream(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
messages=[
*conversation,
{"role": "user", "content": user_message},
],
) as stream:
async for text in stream.text_stream:
# Server-Sent Events format
yield f"data: {text}\n\n"
yield "data: [DONE]\n\n"
return StreamingHttpResponse(
generate(),
content_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"X-Accel-Buffering": "no", # Disable nginx buffering
},
)On the frontend, this pairs perfectly with HTMX's SSE extension or a simple EventSource:
// Minimal JS for streaming -- works with Django's SSE response
const source = new EventSource(`/ai/chat/?${params}`);
const output = document.getElementById("chat-output");
source.onmessage = (event) => {
if (event.data === "[DONE]") {
source.close();
return;
}
output.textContent += event.data;
};Critical deployment detail: you need ASGI for streaming. WSGI (gunicorn with sync workers) buffers the entire response before sending it. Use uvicorn or daphne, or use gunicorn with uvicorn workers:
gunicorn myproject.asgi:application \
--worker-class uvicorn.workers.UvicornWorker \
--workers 4 \
--timeout 120 # LLM responses can take a whileFor more on the ASGI vs. WSGI decision, see our async Django assessment.
Building a RAG Pipeline in Django
Retrieval-Augmented Generation (RAG) is the most common AI pattern we implement for clients. The typical setup: take a user's question, find relevant documents, and feed them to an LLM for a grounded answer. Here is how we build this entirely within Django.
Step 1: Vector Storage with pgvector
If you are already running PostgreSQL (and you are, because you are using Django), add pgvector:
-- One-time setup (run as superuser or via migration)
CREATE EXTENSION IF NOT EXISTS vector;# documents/models.py
from pgvector.django import VectorField
class Document(models.Model):
title = models.CharField(max_length=300)
content = models.TextField()
embedding = VectorField(dimensions=384) # Matches all-MiniLM-L6-v2 output
source_url = models.URLField(blank=True)
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
class Meta:
indexes = [
# IVFFlat index for fast similarity search
models.Index(
fields=["embedding"],
name="document_embedding_idx",
opclasses=["vector_cosine_ops"],
),
]Step 2: Background Embedding Generation
When documents are created or updated, generate embeddings using the Tasks framework:
# documents/tasks.py
from django.tasks import task
from ai_features.model_registry import ModelRegistry
@task(queue="embeddings")
def generate_embedding(document_id: int):
doc = Document.objects.get(id=document_id)
model = ModelRegistry.get("embeddings")
# Generate embedding from document content
embedding = model.encode(doc.content, normalize_embeddings=True)
# Store as list (pgvector handles the conversion)
doc.embedding = embedding.tolist()
doc.save(update_fields=["embedding", "updated_at"])# documents/signals.py
from django.db.models.signals import post_save
from django.dispatch import receiver
@receiver(post_save, sender=Document)
def queue_embedding(sender, instance, **kwargs):
from .tasks import generate_embedding
generate_embedding.enqueue(document_id=instance.id)Step 3: RAG Query View
# ai_features/views.py
from pgvector.django import CosineDistance
async def ask_question(request):
question = request.POST["question"]
# 1. Embed the question
embedder = ModelRegistry.get("embeddings")
question_embedding = embedder.encode(question, normalize_embeddings=True).tolist()
# 2. Find relevant documents (vector similarity search)
relevant_docs = await (
Document.objects
.annotate(distance=CosineDistance("embedding", question_embedding))
.order_by("distance")
[:5]
.alist()
)
# 3. Build context from retrieved documents
context = "\n\n---\n\n".join(
f"Source: {doc.title}\n{doc.content}" for doc in relevant_docs
)
# 4. Stream LLM response with context
async def generate():
async with client.messages.stream(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
system=f"Answer based on the following documents:\n\n{context}",
messages=[{"role": "user", "content": question}],
) as stream:
async for text in stream.text_stream:
yield f"data: {text}\n\n"
yield "data: [DONE]\n\n"
return StreamingHttpResponse(
generate(),
content_type="text/event-stream",
)The entire RAG pipeline -- embedding generation, vector search, LLM streaming -- runs within Django. No separate services. No internal API calls. No message queues (except Django Tasks for background embedding generation, which uses your existing PostgreSQL).
The Django AI Toolkit in 2026
One of the questions we get asked most: "Which AI framework should I use with Django?" Here is our honest assessment:
| Tool | What It Does | When to Use It | When to Skip It |
|---|---|---|---|
anthropic SDK |
Direct Claude API calls | Always (for LLM features) | Never -- it is the cleanest SDK |
pgvector + django-pgvector |
Vector similarity search in PostgreSQL | RAG, semantic search, recommendations | If you need >10M vectors (consider Pinecone/Weaviate) |
sentence-transformers |
Local embedding generation | When you embed frequently or need offline capability | If you only embed occasionally (use API embeddings instead) |
torch (PyTorch) |
Custom model inference | Classification, NER, image models | If you only need LLM features (use API) |
django-tasks |
Background ML jobs | Embedding generation, batch inference, model retraining | If jobs need complex workflows (use Celery) |
langchain |
LLM application framework | -- | Almost always skip it (see below) |
The anti-LangChain opinion: I know this is controversial, but after building six production LLM applications, we stopped using LangChain entirely. The abstraction layer adds complexity without proportional value. The anthropic SDK is clean, well-documented, and does exactly what you need. LangChain's chain abstraction makes simple things verbose and complex things opaque. When something breaks in a LangChain pipeline, you are debugging through five layers of abstraction to find a prompt that needs one word changed.
Direct SDK calls with Django views give you:
- Debuggability -- you can read every line of the request/response flow
- Type safety -- the
anthropicSDK has excellent type hints - Flexibility -- no framework opinions about prompt structure or chain composition
- Fewer dependencies -- LangChain pulls in 50+ transitive dependencies
Build your own thin wrapper around the SDK if you need shared patterns. It will be 50 lines of code that you understand completely, instead of a framework you understand partially.
Production Patterns
Gunicorn --preload for Shared Model Memory
This is critical for serving models from Django. Without --preload, each gunicorn worker loads its own copy of the model:
# BAD: Each worker loads the model separately
# 4 workers x 500MB model = 2GB RAM
gunicorn myproject.wsgi:application --workers 4
# GOOD: Model loaded once in master process, shared via copy-on-write
# 4 workers sharing 1 copy = ~500MB RAM + small per-worker overhead
gunicorn myproject.wsgi:application --workers 4 --preloadThe --preload flag loads your application code (including AppConfig.ready() and model loading) in the master process before forking workers. Workers share the parent's memory pages via the OS copy-on-write mechanism. As long as workers only read the model weights (which they should -- inference is read-only), memory is shared.
Request Timeouts
LLM API calls can take 5-30 seconds. Default gunicorn timeout is 30 seconds. Streaming responses need even more time. Configure accordingly:
# gunicorn.conf.py
timeout = 120 # Seconds before killing a worker
graceful_timeout = 30 # Seconds to finish in-progress requests on shutdown
keepalive = 5 # Seconds to wait for next request on keep-alive connection
# For ASGI with uvicorn workers
worker_class = "uvicorn.workers.UvicornWorker"
workers = 4Rate Limiting
Protect your AI endpoints from abuse. We use django-ratelimit with per-user limits:
from django_ratelimit.decorators import ratelimit
@ratelimit(key="user", rate="20/m", method="POST", block=True)
async def chat_response(request):
# ... streaming LLM response
passTwenty requests per minute per user is generous for a chat interface. Adjust based on your LLM costs and expected usage. At ~$0.003 per Claude Sonnet request, 20 requests/minute/user can add up fast if someone writes a script against your endpoint.
Monitoring
Track your AI-specific metrics:
# ai_features/middleware.py
import time
import logging
logger = logging.getLogger("ai_metrics")
class AIMetricsMiddleware:
AI_PATHS = ["/ai/", "/api/chat/", "/api/search/"]
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
is_ai = any(request.path.startswith(p) for p in self.AI_PATHS)
if not is_ai:
return self.get_response(request)
start = time.monotonic()
response = self.get_response(request)
duration = time.monotonic() - start
logger.info(
"ai_request",
extra={
"path": request.path,
"duration_ms": round(duration * 1000),
"status": response.status_code,
"user_id": getattr(request.user, "id", None),
"model": response.get("X-AI-Model", "unknown"),
},
)
return responseWhen to Actually Use a Separate Service
I have spent this entire post arguing for the monolith approach. Let me be fair about when the microservice split is genuinely better:
- GPU inference -- if your model needs a GPU and your Django app does not, separate them. GPU instances are expensive and scale differently.
- Model > 5GB -- loading a 7B parameter model into your Django process is not practical. Use a dedicated inference server (vLLM, TGI, or a managed service).
- Different scaling profiles -- if your AI endpoint gets 100x more traffic than your web app during peak hours, independent scaling makes sense.
- Team boundaries -- if you have a dedicated ML team that deploys on their own schedule, a separate service with a stable API contract reduces coordination overhead.
- Multiple consumers -- if the same model serves Django, a mobile app backend, and a data pipeline, a shared inference service avoids duplication.
For everyone else -- and that is most teams building AI features into existing Django applications -- keep it in the monolith until the pain of doing so is real, not theoretical.
This is the fourth post in our Django in 2026 series. Previously: Async Django -- an honest assessment. Next up: DRF vs FastAPI -- an honest comparison.
Building AI features into a Django application? At CODERCOPS we have shipped production ML systems ranging from simple classification to full RAG pipelines, all within Django. Whether you are evaluating architectures or debugging latency issues, we can help. Check out our post on building AI agents that actually work or browse our other engineering deep dives for more production-tested patterns.
Comments