AI Integration · Local AI
Running LLMs Locally with Ollama: A Practical Guide for Developers
Ollama makes it possible to run Llama 3, Mistral, Phi-4, and dozens of other open-weight models on your laptop or server with a single command. Here's what actually works and when local inference makes sense.
Anurag Verma
7 min read
Sponsored
Most teams reach for a hosted API when they need an LLM. That makes sense for production apps: you pay per token, you don’t manage GPUs, and you get access to the most capable models. But hosted APIs aren’t always the right tool. Offline-capable tools, privacy-sensitive workflows, development environments without API costs, and tight latency requirements can all push you toward running models locally.
Ollama is the most practical way to do that on a Mac or Linux machine today. It handles model downloads, GGUF format management, Metal and CUDA acceleration, and exposes a REST API that’s OpenAI-compatible by design.
Installing Ollama and Running Your First Model
# macOS or Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download the installer from ollama.com
Once installed, running a model is one command:
ollama run llama3.1
On first run, Ollama downloads the model (llama3.1:8b is about 4.7 GB). After that it’s cached locally. You get an interactive chat prompt immediately.
For scripting or API access, Ollama runs a server at localhost:11434:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Explain CORS in one paragraph for a junior developer.",
"stream": false
}'
The /api/chat endpoint follows the OpenAI chat completions format, which means any code that uses the OpenAI SDK can talk to Ollama with a two-line change:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:11434/v1",
apiKey: "ollama", // any string works
});
const response = await client.chat.completions.create({
model: "llama3.1",
messages: [
{ role: "user", content: "Write a SQL query to find duplicate emails." }
],
});
This compatibility means you can switch between Ollama locally and a cloud provider in production by changing one environment variable.
Which Models Are Worth Using
Ollama’s model library covers most of the major open-weight models. The ones worth knowing:
| Model | Size | Best for |
|---|---|---|
| llama3.1:8b | 4.7 GB | General chat, instruction following |
| llama3.1:70b | 40 GB | Higher-quality responses, needs 48+ GB RAM |
| mistral:7b | 4.1 GB | Fast, good at reasoning and code |
| phi4:14b | 8.1 GB | Strong reasoning, efficient for its size |
| qwen2.5-coder:7b | 4.7 GB | Code completion, code explanation |
| qwen2.5-coder:32b | 19 GB | Near-GPT-4 code quality on M3 Max / workstation GPU |
| deepseek-r1:8b | 4.7 GB | Reasoning with visible chain-of-thought |
| gemma3:9b | 5.4 GB | Multilingual, good for European languages |
| nomic-embed-text | 274 MB | Text embeddings for local RAG |
For most development tasks on a MacBook Pro M3 or M4, llama3.1:8b or mistral:7b gives usable results and runs fast enough to not be annoying (60-80 tokens per second). For serious coding work, qwen2.5-coder:32b on an M3 Max or M4 Ultra is meaningfully better.
The 70B and larger models are only practical if you have 64+ GB of unified memory (M2/M3 Ultra chips or a workstation GPU with VRAM). On less hardware they run but slowly.
Hardware and What to Expect
Ollama uses:
- Apple Silicon: Metal GPU acceleration, full unified memory access. This is the best consumer option for local models. A Mac with 32 GB memory runs 7-8B models very comfortably.
- NVIDIA GPUs: CUDA. RTX 4090 (24 GB VRAM) handles 13B models fully in GPU memory.
- AMD GPUs: ROCm support, Linux only.
- CPU-only: Works but slow. Usable for occasional queries, not interactive chat.
Token generation speed scales roughly linearly with GPU memory bandwidth. Fitting the model entirely in GPU memory (or unified memory on Apple Silicon) is the single biggest factor.
The Modelfile: Custom System Prompts and Parameters
Ollama lets you create derived models from a Modelfile. This is useful for pre-loading a system prompt or adjusting generation parameters:
# Modelfile
FROM llama3.1
SYSTEM """
You are a code reviewer for a TypeScript codebase.
Focus on: type safety, error handling, and performance.
Be concise. Point to specific lines.
Never suggest refactors that aren't asked for.
"""
PARAMETER temperature 0.3
PARAMETER top_p 0.9
ollama create code-reviewer -f Modelfile
ollama run code-reviewer
The resulting model appears in your local model list and can be used through the API like any other model. Teams can commit Modelfiles to their repo and distribute them to engineers who all run the same configured model.
Local RAG with Ollama
For building retrieval-augmented pipelines locally, the combination of Ollama (for the language model) and a local embedding model gives you a fully offline stack:
ollama pull nomic-embed-text
import ollama
import chromadb
from chromadb.utils import embedding_functions
# Generate embeddings locally
def embed(texts):
response = ollama.embeddings(model="nomic-embed-text", prompt=texts[0])
return response["embedding"]
# Store in Chroma (also local)
client = chromadb.Client()
collection = client.create_collection("docs")
# Add documents
collection.add(
documents=["Ollama supports GGUF format models.",
"Models are cached in ~/.ollama/models/"],
ids=["doc1", "doc2"],
embeddings=[embed(["Ollama supports GGUF format models."]),
embed(["Models are cached in ~/.ollama/models/"])]
)
# Query
results = collection.query(query_texts=["where are models stored?"], n_results=2)
context = "\n".join(results["documents"][0])
# Answer
response = ollama.generate(
model="llama3.1",
prompt=f"Context: {context}\n\nQuestion: Where are Ollama models stored?"
)
print(response["response"])
This approach works for developer tools, internal knowledge bases, or client projects where the documents can’t leave the premises.
When Local Makes Sense
The cases where local inference is the right call:
Development iteration with no API costs. Running a 7B model locally during development is free after the hardware cost. If you’re iterating quickly on prompts, this is faster than waiting for billing caps or rate limits.
Privacy requirements. Some clients won’t let you send their data to external APIs. Legal documents, medical records, financial data. Ollama gives you a way to serve those clients without self-hosting a GPU cluster.
Offline or air-gapped environments. Field tools that work without internet. Embedded systems. Regulated environments where outbound network access is restricted.
Latency that API round-trips can’t meet. A local 7B model at 60 tokens/second with 10ms time-to-first-token is faster than an API call that hits a cold container across the Atlantic.
When Local Doesn’t Make Sense
Local inference has real limits:
- You won’t match GPT-4o or Claude Opus quality with a 7-8B model. For tasks that need frontier model capability, use the API.
- Models don’t update automatically. The “Llama 3.1” you ran six months ago is the same model unless you explicitly pull a new version.
- Managing model versions across a team requires discipline. If three engineers are using different quantizations of the same model, results will vary.
- Production serving at scale requires proper infrastructure. Ollama is a developer tool, not a production inference server. For serving many users, look at vLLM, TGI, or a managed provider.
The pattern that works: use Ollama locally for development and experimentation, and only make the call about local vs cloud for production when you have data on what the workload actually needs.
Multimodal with Llava
Ollama supports vision models through the llava family:
ollama run llava
# Describe an image from the CLI
ollama run llava "What's in this image?" --image ./screenshot.png
import ollama
import base64
with open("screenshot.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = ollama.chat(
model="llava",
messages=[{
"role": "user",
"content": "What UI elements are visible in this screenshot?",
"images": [image_data]
}]
)
This covers the common case of screenshot analysis, UI testing, or document image extraction without sending screenshots to a cloud provider.
The bottom line on Ollama: it’s the easiest path to local inference on developer hardware, the OpenAI-compatible API means minimal code changes when switching between local and cloud, and it handles enough of the tooling (model management, acceleration, serving) that you don’t need to build your own. For any project where you want a local LLM option, starting with Ollama is the right call.
Sponsored
More from this category
More from AI Integration
LangChain vs LlamaIndex in 2026: Choosing the Right AI Framework
Structured Outputs from LLMs: JSON Mode, Tool Calls, and Schema Validation in Practice
Model Context Protocol in Production: How MCP Is Connecting the AI Tool Ecosystem
Sponsored
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored