Django as Your AI Backend -- Serving ML Models Without the Microservices Tax

Last year we built an AI feature for a client's Django application. The standard advice said: spin up a separate FastAPI service for model inference, have Django call it via HTTP, keep concerns separated. Clean architecture. Best practices.

So we did. And then we measured the latency.

The FastAPI service ran inference in 45ms. The network round-trip between Django and FastAPI (both running on the same Kubernetes cluster) added 80ms. Authentication between the services added another 15ms. JSON serialization of the request and deserialization of the response added 12ms. The "clean architecture" overhead was longer than the actual AI work.

We moved the model into the Django process the next week. Total inference time dropped from 152ms to 50ms. We deleted a Dockerfile, a Kubernetes deployment, a service mesh configuration, an internal API client library, and a set of integration tests that existed solely to verify the two services could talk to each other.

This is not always the right call. There are legitimate reasons to run models in a separate service. But the default assumption that "ML must be a microservice" is costing teams time, money, and latency for no good reason. Here is when the monolith wins, and how to do it properly.

The Case Against the Microservice Split

The microservice approach sounds right in a conference talk: "Separate your ML inference from your application logic. Scale them independently. Use the best tool for each job."

In practice, here is what "separate" actually means for a typical team:

Concern	Monolith (Django)	Microservice (Django + FastAPI)
Repositories	1	2
Deployment pipelines	1	2
Docker images	1	2
Authentication systems	1 (Django auth)	2 (Django auth + service-to-service auth)
API contracts to maintain	0 (function calls)	1 (internal REST/gRPC API)
Integration tests	Standard Django tests	Cross-service integration tests
Monitoring dashboards	1	2 (plus inter-service latency)
Infrastructure cost	1 server/pod	2 servers/pods (ML service needs GPU or high memory)
Debugging a request	One log stream	Distributed tracing across two services

For a team of 3-8 engineers working on a product where AI is a feature (not the entire product), the microservice split doubles your operational surface area. Every deployment is coordinated across two services. Every schema change in the ML response requires updating two codebases. Every on-call incident requires checking two sets of logs.

The actual cost comparison from our client project:

Item	Microservice Setup	Monolith Setup	Monthly Savings
Kubernetes pods	4 (2 Django + 2 FastAPI)	3 (Django with more memory)	$85
CI/CD pipeline minutes	2x (two builds, two deployments)	1x	$25
Engineer hours on infra	~8 hrs/month (service mesh, auth, debugging)	~2 hrs/month	~$600 (at loaded cost)
Latency overhead	107ms per AI request	5ms (function call)	N/A (but impacts UX)

The $710/month difference is not huge for a well-funded team. But the 6 engineer-hours difference is. Those hours compound -- they are 6 hours not spent on features, not spent improving the model, not spent talking to users.

Loading Models in Django

The most common question: "If I load a 500MB model in Django, won't it eat all my memory?"

Yes, if you do it wrong. Here is how to do it right.

The AppConfig.ready() Pattern

Django's AppConfig.ready() method runs once when the application starts. This is where you load your models:

# ai_features/apps.py
import logging
from django.apps import AppConfig

logger = logging.getLogger(__name__)

class AiFeaturesConfig(AppConfig):
    name = "ai_features"

    def ready(self):
        # Import here to avoid circular imports
        from .model_registry import ModelRegistry
        ModelRegistry.initialize()
        logger.info("AI models loaded successfully")

# ai_features/model_registry.py
import torch
from sentence_transformers import SentenceTransformer
from pathlib import Path

class ModelRegistry:
    """Singleton registry for ML models. Loaded once at startup."""

    _models = {}
    _initialized = False

    @classmethod
    def initialize(cls):
        if cls._initialized:
            return

        model_dir = Path(__file__).parent / "models"

        # Load embedding model (~120MB)
        cls._models["embeddings"] = SentenceTransformer(
            str(model_dir / "all-MiniLM-L6-v2"),
            device="cpu",  # or "cuda" if you have a GPU
        )

        # Load classification model (~85MB)
        cls._models["classifier"] = torch.jit.load(
            str(model_dir / "content_classifier.pt"),
            map_location="cpu",
        )
        cls._models["classifier"].eval()  # Set to inference mode

        cls._initialized = True
        logger.info(f"Loaded {len(cls._models)} models")

    @classmethod
    def get(cls, name: str):
        if not cls._initialized:
            raise RuntimeError("ModelRegistry not initialized. Check AppConfig.ready()")
        if name not in cls._models:
            raise KeyError(f"Unknown model: {name}. Available: {list(cls._models.keys())}")
        return cls._models[name]

# ai_features/views.py
from .model_registry import ModelRegistry

def classify_content(request):
    text = request.POST["content"]
    model = ModelRegistry.get("classifier")

    with torch.no_grad():
        prediction = model(tokenize(text))
        category = LABELS[prediction.argmax().item()]

    return JsonResponse({"category": category, "confidence": prediction.max().item()})

The model loads once when gunicorn starts, and all workers share the parent process's memory via copy-on-write (when using --preload). A 500MB model does not use 500MB per worker -- it uses 500MB total.

Model Size Guidelines

Not every model belongs in your Django process. Here is our rule of thumb:

Model Size	Strategy	Example
< 100MB	Load in Django process	Scikit-learn models, small transformers, TF-Lite
100MB - 1GB	Load in Django with `--preload`	Sentence transformers, medium classification models
1GB - 5GB	Consider separate process on same machine	Larger transformer models, image models
> 5GB	Separate service or managed inference (SageMaker, Replicate)	Large language models, Stable Diffusion

The dividing line is not about capability -- it is about memory and startup time. A 2GB model takes 15-30 seconds to load, which means slow deploys and slow autoscaling. Above 5GB, you are typically looking at GPU inference, and that justifies a separate service because GPU instances have different scaling characteristics.

Streaming LLM Responses from Django

This is the pattern that most teams are asking about in 2026: streaming responses from language models. The user asks a question, the LLM generates tokens one at a time, and you want to stream them to the browser as they are produced.

Django handles this beautifully with StreamingHttpResponse and async views:

# ai_features/views.py
import anthropic
from django.http import StreamingHttpResponse

client = anthropic.AsyncAnthropic()  # Uses ANTHROPIC_API_KEY env var

async def chat_response(request):
    user_message = request.POST["message"]
    conversation = await get_conversation_history(request.user, request.POST["conversation_id"])

    async def generate():
        async with client.messages.stream(
            model="claude-sonnet-4-6-20250514",
            max_tokens=1024,
            messages=[
                *conversation,
                {"role": "user", "content": user_message},
            ],
        ) as stream:
            async for text in stream.text_stream:
                # Server-Sent Events format
                yield f"data: {text}\n\n"

        yield "data: [DONE]\n\n"

    return StreamingHttpResponse(
        generate(),
        content_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # Disable nginx buffering
        },
    )

On the frontend, this pairs perfectly with HTMX's SSE extension or a simple EventSource:

// Minimal JS for streaming -- works with Django's SSE response
const source = new EventSource(`/ai/chat/?${params}`);
const output = document.getElementById("chat-output");

source.onmessage = (event) => {
    if (event.data === "[DONE]") {
        source.close();
        return;
    }
    output.textContent += event.data;
};

Critical deployment detail: you need ASGI for streaming. WSGI (gunicorn with sync workers) buffers the entire response before sending it. Use uvicorn or daphne, or use gunicorn with uvicorn workers:

gunicorn myproject.asgi:application \
    --worker-class uvicorn.workers.UvicornWorker \
    --workers 4 \
    --timeout 120  # LLM responses can take a while

For more on the ASGI vs. WSGI decision, see our async Django assessment.

Building a RAG Pipeline in Django

Retrieval-Augmented Generation (RAG) is the most common AI pattern we implement for clients. The typical setup: take a user's question, find relevant documents, and feed them to an LLM for a grounded answer. Here is how we build this entirely within Django.

Step 1: Vector Storage with pgvector

If you are already running PostgreSQL (and you are, because you are using Django), add pgvector:

-- One-time setup (run as superuser or via migration)
CREATE EXTENSION IF NOT EXISTS vector;

# documents/models.py
from pgvector.django import VectorField

class Document(models.Model):
    title = models.CharField(max_length=300)
    content = models.TextField()
    embedding = VectorField(dimensions=384)  # Matches all-MiniLM-L6-v2 output
    source_url = models.URLField(blank=True)
    created_at = models.DateTimeField(auto_now_add=True)
    updated_at = models.DateTimeField(auto_now=True)

    class Meta:
        indexes = [
            # IVFFlat index for fast similarity search
            models.Index(
                fields=["embedding"],
                name="document_embedding_idx",
                opclasses=["vector_cosine_ops"],
            ),
        ]

Step 2: Background Embedding Generation

When documents are created or updated, generate embeddings using the Tasks framework:

# documents/tasks.py
from django.tasks import task
from ai_features.model_registry import ModelRegistry

@task(queue="embeddings")
def generate_embedding(document_id: int):
    doc = Document.objects.get(id=document_id)
    model = ModelRegistry.get("embeddings")

    # Generate embedding from document content
    embedding = model.encode(doc.content, normalize_embeddings=True)

    # Store as list (pgvector handles the conversion)
    doc.embedding = embedding.tolist()
    doc.save(update_fields=["embedding", "updated_at"])

# documents/signals.py
from django.db.models.signals import post_save
from django.dispatch import receiver

@receiver(post_save, sender=Document)
def queue_embedding(sender, instance, **kwargs):
    from .tasks import generate_embedding
    generate_embedding.enqueue(document_id=instance.id)

Step 3: RAG Query View

# ai_features/views.py
from pgvector.django import CosineDistance

async def ask_question(request):
    question = request.POST["question"]

    # 1. Embed the question
    embedder = ModelRegistry.get("embeddings")
    question_embedding = embedder.encode(question, normalize_embeddings=True).tolist()

    # 2. Find relevant documents (vector similarity search)
    relevant_docs = await (
        Document.objects
        .annotate(distance=CosineDistance("embedding", question_embedding))
        .order_by("distance")
        [:5]
        .alist()
    )

    # 3. Build context from retrieved documents
    context = "\n\n---\n\n".join(
        f"Source: {doc.title}\n{doc.content}" for doc in relevant_docs
    )

    # 4. Stream LLM response with context
    async def generate():
        async with client.messages.stream(
            model="claude-sonnet-4-6-20250514",
            max_tokens=1024,
            system=f"Answer based on the following documents:\n\n{context}",
            messages=[{"role": "user", "content": question}],
        ) as stream:
            async for text in stream.text_stream:
                yield f"data: {text}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingHttpResponse(
        generate(),
        content_type="text/event-stream",
    )

The entire RAG pipeline -- embedding generation, vector search, LLM streaming -- runs within Django. No separate services. No internal API calls. No message queues (except Django Tasks for background embedding generation, which uses your existing PostgreSQL).

The Django AI Toolkit in 2026

One of the questions we get asked most: "Which AI framework should I use with Django?" Here is our honest assessment:

Tool	What It Does	When to Use It	When to Skip It
`anthropic` SDK	Direct Claude API calls	Always (for LLM features)	Never -- it is the cleanest SDK
`pgvector` + `django-pgvector`	Vector similarity search in PostgreSQL	RAG, semantic search, recommendations	If you need >10M vectors (consider Pinecone/Weaviate)
`sentence-transformers`	Local embedding generation	When you embed frequently or need offline capability	If you only embed occasionally (use API embeddings instead)
`torch` (PyTorch)	Custom model inference	Classification, NER, image models	If you only need LLM features (use API)
`django-tasks`	Background ML jobs	Embedding generation, batch inference, model retraining	If jobs need complex workflows (use Celery)
`langchain`	LLM application framework	--	Almost always skip it (see below)

The anti-LangChain opinion: I know this is controversial, but after building six production LLM applications, we stopped using LangChain entirely. The abstraction layer adds complexity without proportional value. The anthropic SDK is clean, well-documented, and does exactly what you need. LangChain's chain abstraction makes simple things verbose and complex things opaque. When something breaks in a LangChain pipeline, you are debugging through five layers of abstraction to find a prompt that needs one word changed.

Direct SDK calls with Django views give you:

Debuggability -- you can read every line of the request/response flow
Type safety -- the anthropic SDK has excellent type hints
Flexibility -- no framework opinions about prompt structure or chain composition
Fewer dependencies -- LangChain pulls in 50+ transitive dependencies

Build your own thin wrapper around the SDK if you need shared patterns. It will be 50 lines of code that you understand completely, instead of a framework you understand partially.

Production Patterns

Gunicorn `--preload` for Shared Model Memory

This is critical for serving models from Django. Without --preload, each gunicorn worker loads its own copy of the model:

# BAD: Each worker loads the model separately
# 4 workers x 500MB model = 2GB RAM
gunicorn myproject.wsgi:application --workers 4

# GOOD: Model loaded once in master process, shared via copy-on-write
# 4 workers sharing 1 copy = ~500MB RAM + small per-worker overhead
gunicorn myproject.wsgi:application --workers 4 --preload

The --preload flag loads your application code (including AppConfig.ready() and model loading) in the master process before forking workers. Workers share the parent's memory pages via the OS copy-on-write mechanism. As long as workers only read the model weights (which they should -- inference is read-only), memory is shared.

Request Timeouts

LLM API calls can take 5-30 seconds. Default gunicorn timeout is 30 seconds. Streaming responses need even more time. Configure accordingly:

# gunicorn.conf.py
timeout = 120  # Seconds before killing a worker
graceful_timeout = 30  # Seconds to finish in-progress requests on shutdown
keepalive = 5  # Seconds to wait for next request on keep-alive connection

# For ASGI with uvicorn workers
worker_class = "uvicorn.workers.UvicornWorker"
workers = 4

Rate Limiting

Protect your AI endpoints from abuse. We use django-ratelimit with per-user limits:

from django_ratelimit.decorators import ratelimit

@ratelimit(key="user", rate="20/m", method="POST", block=True)
async def chat_response(request):
    # ... streaming LLM response
    pass

Twenty requests per minute per user is generous for a chat interface. Adjust based on your LLM costs and expected usage. At ~$0.003 per Claude Sonnet request, 20 requests/minute/user can add up fast if someone writes a script against your endpoint.

Monitoring

Track your AI-specific metrics:

# ai_features/middleware.py
import time
import logging

logger = logging.getLogger("ai_metrics")

class AIMetricsMiddleware:
    AI_PATHS = ["/ai/", "/api/chat/", "/api/search/"]

    def __init__(self, get_response):
        self.get_response = get_response

    def __call__(self, request):
        is_ai = any(request.path.startswith(p) for p in self.AI_PATHS)
        if not is_ai:
            return self.get_response(request)

        start = time.monotonic()
        response = self.get_response(request)
        duration = time.monotonic() - start

        logger.info(
            "ai_request",
            extra={
                "path": request.path,
                "duration_ms": round(duration * 1000),
                "status": response.status_code,
                "user_id": getattr(request.user, "id", None),
                "model": response.get("X-AI-Model", "unknown"),
            },
        )
        return response

When to Actually Use a Separate Service

I have spent this entire post arguing for the monolith approach. Let me be fair about when the microservice split is genuinely better:

GPU inference -- if your model needs a GPU and your Django app does not, separate them. GPU instances are expensive and scale differently.
Model > 5GB -- loading a 7B parameter model into your Django process is not practical. Use a dedicated inference server (vLLM, TGI, or a managed service).
Different scaling profiles -- if your AI endpoint gets 100x more traffic than your web app during peak hours, independent scaling makes sense.
Team boundaries -- if you have a dedicated ML team that deploys on their own schedule, a separate service with a stable API contract reduces coordination overhead.
Multiple consumers -- if the same model serves Django, a mobile app backend, and a data pipeline, a shared inference service avoids duplication.

For everyone else -- and that is most teams building AI features into existing Django applications -- keep it in the monolith until the pain of doing so is real, not theoretical.

This is the fourth post in our Django in 2026 series. Previously: Async Django -- an honest assessment. Next up: DRF vs FastAPI -- an honest comparison.

Building AI features into a Django application? At CODERCOPS we have shipped production ML systems ranging from simple classification to full RAG pipelines, all within Django. Whether you are evaluating architectures or debugging latency issues, we can help. Check out our post on building AI agents that actually work or browse our other engineering deep dives for more production-tested patterns.

Django as Your AI Backend -- Serving ML Models Without the Microservices Tax

The Case Against the Microservice Split

Loading Models in Django

The AppConfig.ready() Pattern

Model Size Guidelines

Streaming LLM Responses from Django

Building a RAG Pipeline in Django

Step 1: Vector Storage with pgvector

Step 2: Background Embedding Generation

Step 3: RAG Query View

The Django AI Toolkit in 2026

Production Patterns

Gunicorn `--preload` for Shared Model Memory

Request Timeouts

Rate Limiting

Monitoring

When to Actually Use a Separate Service

Comments

On this page

The Case Against the Microservice Split

Loading Models in Django

The AppConfig.ready() Pattern

Model Size Guidelines

Streaming LLM Responses from Django

Building a RAG Pipeline in Django

Step 1: Vector Storage with pgvector

Step 2: Background Embedding Generation

Step 3: RAG Query View

The Django AI Toolkit in 2026

Production Patterns

Gunicorn --preload for Shared Model Memory

Request Timeouts

Rate Limiting

Monitoring

When to Actually Use a Separate Service

Comments

Related Posts More from AI Integration

Why We Chose to Be an AI-First Agency (Not Just an Agency That Uses AI)

Agentic AI Hit the Trough of Disillusionment — And That's the Best Thing That Could Have Happened

Edge AI in 2026: Why Running Models on Tiny Devices Is Bigger Than You Think

Stay in the loop

On this page

Gunicorn `--preload` for Shared Model Memory