Running LLMs Locally with Ollama: A Practical Guide for Developers

Most teams reach for a hosted API when they need an LLM. That makes sense for production apps: you pay per token, you don’t manage GPUs, and you get access to the most capable models. But hosted APIs aren’t always the right tool. Offline-capable tools, privacy-sensitive workflows, development environments without API costs, and tight latency requirements can all push you toward running models locally.

Ollama is the most practical way to do that on a Mac or Linux machine today. It handles model downloads, GGUF format management, Metal and CUDA acceleration, and exposes a REST API that’s OpenAI-compatible by design.

Installing Ollama and Running Your First Model

# macOS or Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download the installer from ollama.com

Once installed, running a model is one command:

ollama run llama3.1

On first run, Ollama downloads the model (llama3.1:8b is about 4.7 GB). After that it’s cached locally. You get an interactive chat prompt immediately.

For scripting or API access, Ollama runs a server at localhost:11434:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain CORS in one paragraph for a junior developer.",
  "stream": false
}'

The /api/chat endpoint follows the OpenAI chat completions format, which means any code that uses the OpenAI SDK can talk to Ollama with a two-line change:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama", // any string works
});

const response = await client.chat.completions.create({
  model: "llama3.1",
  messages: [
    { role: "user", content: "Write a SQL query to find duplicate emails." }
  ],
});

This compatibility means you can switch between Ollama locally and a cloud provider in production by changing one environment variable.

Which Models Are Worth Using

Ollama’s model library covers most of the major open-weight models. The ones worth knowing:

Model	Size	Best for
llama3.1:8b	4.7 GB	General chat, instruction following
llama3.1:70b	40 GB	Higher-quality responses, needs 48+ GB RAM
mistral:7b	4.1 GB	Fast, good at reasoning and code
phi4:14b	8.1 GB	Strong reasoning, efficient for its size
qwen2.5-coder:7b	4.7 GB	Code completion, code explanation
qwen2.5-coder:32b	19 GB	Near-GPT-4 code quality on M3 Max / workstation GPU
deepseek-r1:8b	4.7 GB	Reasoning with visible chain-of-thought
gemma3:9b	5.4 GB	Multilingual, good for European languages
nomic-embed-text	274 MB	Text embeddings for local RAG

For most development tasks on a MacBook Pro M3 or M4, llama3.1:8b or mistral:7b gives usable results and runs fast enough to not be annoying (60-80 tokens per second). For serious coding work, qwen2.5-coder:32b on an M3 Max or M4 Ultra is meaningfully better.

The 70B and larger models are only practical if you have 64+ GB of unified memory (M2/M3 Ultra chips or a workstation GPU with VRAM). On less hardware they run but slowly.

Hardware and What to Expect

Ollama uses:

Apple Silicon: Metal GPU acceleration, full unified memory access. This is the best consumer option for local models. A Mac with 32 GB memory runs 7-8B models very comfortably.
NVIDIA GPUs: CUDA. RTX 4090 (24 GB VRAM) handles 13B models fully in GPU memory.
AMD GPUs: ROCm support, Linux only.
CPU-only: Works but slow. Usable for occasional queries, not interactive chat.

Token generation speed scales roughly linearly with GPU memory bandwidth. Fitting the model entirely in GPU memory (or unified memory on Apple Silicon) is the single biggest factor.

The Modelfile: Custom System Prompts and Parameters

Ollama lets you create derived models from a Modelfile. This is useful for pre-loading a system prompt or adjusting generation parameters:

# Modelfile
FROM llama3.1

SYSTEM """
You are a code reviewer for a TypeScript codebase. 
Focus on: type safety, error handling, and performance.
Be concise. Point to specific lines.
Never suggest refactors that aren't asked for.
"""

PARAMETER temperature 0.3
PARAMETER top_p 0.9

ollama create code-reviewer -f Modelfile
ollama run code-reviewer

The resulting model appears in your local model list and can be used through the API like any other model. Teams can commit Modelfiles to their repo and distribute them to engineers who all run the same configured model.

Local RAG with Ollama

For building retrieval-augmented pipelines locally, the combination of Ollama (for the language model) and a local embedding model gives you a fully offline stack:

ollama pull nomic-embed-text

import ollama
import chromadb
from chromadb.utils import embedding_functions

# Generate embeddings locally
def embed(texts):
    response = ollama.embeddings(model="nomic-embed-text", prompt=texts[0])
    return response["embedding"]

# Store in Chroma (also local)
client = chromadb.Client()
collection = client.create_collection("docs")

# Add documents
collection.add(
    documents=["Ollama supports GGUF format models.", 
               "Models are cached in ~/.ollama/models/"],
    ids=["doc1", "doc2"],
    embeddings=[embed(["Ollama supports GGUF format models."]),
                embed(["Models are cached in ~/.ollama/models/"])]
)

# Query
results = collection.query(query_texts=["where are models stored?"], n_results=2)
context = "\n".join(results["documents"][0])

# Answer
response = ollama.generate(
    model="llama3.1",
    prompt=f"Context: {context}\n\nQuestion: Where are Ollama models stored?"
)
print(response["response"])

This approach works for developer tools, internal knowledge bases, or client projects where the documents can’t leave the premises.

When Local Makes Sense

The cases where local inference is the right call:

Development iteration with no API costs. Running a 7B model locally during development is free after the hardware cost. If you’re iterating quickly on prompts, this is faster than waiting for billing caps or rate limits.

Privacy requirements. Some clients won’t let you send their data to external APIs. Legal documents, medical records, financial data. Ollama gives you a way to serve those clients without self-hosting a GPU cluster.

Offline or air-gapped environments. Field tools that work without internet. Embedded systems. Regulated environments where outbound network access is restricted.

Latency that API round-trips can’t meet. A local 7B model at 60 tokens/second with 10ms time-to-first-token is faster than an API call that hits a cold container across the Atlantic.

When Local Doesn’t Make Sense

Local inference has real limits:

You won’t match GPT-4o or Claude Opus quality with a 7-8B model. For tasks that need frontier model capability, use the API.
Models don’t update automatically. The “Llama 3.1” you ran six months ago is the same model unless you explicitly pull a new version.
Managing model versions across a team requires discipline. If three engineers are using different quantizations of the same model, results will vary.
Production serving at scale requires proper infrastructure. Ollama is a developer tool, not a production inference server. For serving many users, look at vLLM, TGI, or a managed provider.

The pattern that works: use Ollama locally for development and experimentation, and only make the call about local vs cloud for production when you have data on what the workload actually needs.

Multimodal with Llava

Ollama supports vision models through the llava family:

ollama run llava

# Describe an image from the CLI
ollama run llava "What's in this image?" --image ./screenshot.png

import ollama
import base64

with open("screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = ollama.chat(
    model="llava",
    messages=[{
        "role": "user",
        "content": "What UI elements are visible in this screenshot?",
        "images": [image_data]
    }]
)

This covers the common case of screenshot analysis, UI testing, or document image extraction without sending screenshots to a cloud provider.

The bottom line on Ollama: it’s the easiest path to local inference on developer hardware, the OpenAI-compatible API means minimal code changes when switching between local and cloud, and it handles enough of the tooling (model management, acceleration, serving) that you don’t need to build your own. For any project where you want a local LLM option, starting with Ollama is the right call.

Running LLMs Locally with Ollama: A Practical Guide for Developers

Installing Ollama and Running Your First Model

Which Models Are Worth Using

Hardware and What to Expect

The Modelfile: Custom System Prompts and Parameters

Local RAG with Ollama

When Local Makes Sense

When Local Doesn’t Make Sense

Multimodal with Llava

LangChain vs LlamaIndex in 2026: Choosing the Right AI Framework

Agency SLAs and Support Contracts: What Ongoing Work Actually Looks Like

More from AI Integration

LangChain vs LlamaIndex in 2026: Choosing the Right AI Framework

Structured Outputs from LLMs: JSON Mode, Tool Calls, and Schema Validation in Practice

Model Context Protocol in Production: How MCP Is Connecting the AI Tool Ecosystem

Join the conversation.