A client came to us last quarter wanting to fine-tune GPT-4 on their 50-page employee handbook. They had already gotten quotes from two vendors: $12,000-18,000 for data preparation, training, and deployment. Timeline: 3-4 weeks. They were ready to sign.

We saved them $15,000 and 3 weeks. We took their handbook, chunked it intelligently, built a simple RAG system with Supabase pgvector, and wrote a well-crafted system prompt. Total cost: $800 in development time, $12/month in infrastructure. The system answers employee questions with 96% accuracy. Better than the fine-tuned approach would have delivered, at a fraction of the cost.

But here is the thing -- I am not telling you this to say "RAG is always better" or "fine-tuning is a scam." Sometimes fine-tuning is exactly the right approach. Sometimes plain prompting is all you need. The problem is that most teams have no framework for making this decision. They either default to whatever the latest blog post recommended, or they let a vendor talk them into the most expensive option. This post gives you a clear, data-backed decision framework so you never have to guess.

The Three Approaches, Explained Simply

Before we get into the framework, let me make sure we are on the same page about what each approach actually is. I am going to explain these as if you are a smart business person who understands technology but is not neck-deep in ML research.

Approach 1: Prompt Engineering

What it is: You take a pre-trained model (like Claude or GPT-4) and give it carefully written instructions in the prompt. No training, no external data systems. Just a well-structured prompt with clear instructions, examples, and constraints.

Analogy: Hiring an expert consultant and giving them a detailed brief before each conversation. They bring their existing knowledge; you just guide how they apply it.

# Prompt engineering example
system_prompt = """You are a customer support agent for
TechCorp. Follow these rules exactly:

1. Always greet the customer by name if provided
2. For billing questions, reference our pricing tiers:
   - Starter: $29/mo (5 users, 10GB storage)
   - Professional: $99/mo (25 users, 100GB storage)
   - Enterprise: $299/mo (unlimited users, 1TB storage)
3. For technical issues, collect: error message,
   browser/device, steps to reproduce
4. Never promise refunds -- escalate to billing team
5. Respond in 2-3 sentences maximum
6. If unsure, say "Let me connect you with a specialist"

TONE: Professional but warm. Use the customer's name.
"""

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=system_prompt,
    messages=[{"role": "user", "content": user_query}]
)

When it works: The model already knows the subject matter, and you just need to control format, style, or behavior.

When it breaks: The model does not know your specific data (product details, company policies, proprietary information). You can only fit so much in a system prompt before it gets unwieldy.

Approach 2: Retrieval-Augmented Generation (RAG)

What it is: You store your documents in a searchable database (usually a vector database). When a user asks a question, you search for relevant documents, inject them into the prompt, and let the model answer based on that specific context.

Analogy: Giving the expert consultant a filing cabinet and telling them "look up the answer in these files before responding." They still use their expertise to interpret and synthesize, but the facts come from your documents.

# RAG example (simplified)
def answer_with_rag(user_query: str) -> str:
    # Step 1: Find relevant documents
    relevant_docs = vector_db.search(
        query=user_query,
        top_k=5
    )

    # Step 2: Build prompt with retrieved context
    context = "\n\n".join([doc.text for doc in relevant_docs])

    prompt = f"""Answer the user's question using ONLY
the information provided below. If the answer is not in
the provided context, say "I don't have information
about that."

CONTEXT:
{context}

USER QUESTION: {user_query}"""

    # Step 3: Generate answer
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        messages=[{"role": "user", "content": prompt}]
    )

    return response.content[0].text

When it works: You have a large body of proprietary documents that the model needs to reference. The information changes frequently. You need citations and traceability.

When it breaks: The retrieval step fails (pulls irrelevant documents), or the information requires deep reasoning that does not map well to "find the right chunk."

Approach 3: Fine-Tuning

What it is: You take a pre-trained model and train it further on your specific data. This modifies the model's weights, essentially teaching it new knowledge or behaviors that become part of the model itself.

Analogy: Sending the expert consultant to a specialized training program. When they come back, they have internalized new knowledge and behaviors. You do not need to brief them every time -- they just know.

# Fine-tuning example (OpenAI format)
# Step 1: Prepare training data
training_data = [
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a medical coding assistant."
            },
            {
                "role": "user",
                "content": "Patient presents with acute "
                           "bronchitis, prescribed amoxicillin"
            },
            {
                "role": "assistant",
                "content": "ICD-10: J20.9 (Acute bronchitis, "
                           "unspecified)\nCPT: 99213 "
                           "(Office visit, established patient, "
                           "low complexity)"
            }
        ]
    },
    # ... hundreds more examples
]

# Step 2: Upload and train
file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18"
)

# Step 3: Use the fine-tuned model
response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:your-org::job-id",
    messages=[{"role": "user", "content": new_query}]
)

When it works: You need the model to learn a specific style, format, or domain-specific reasoning pattern. You have high-quality training data. The task is repeated frequently enough to justify the upfront cost.

When it breaks: You do not have enough quality training data (need 100+ examples minimum, 1000+ for good results). The information changes frequently (you would need to retrain). The task is simple enough that prompting handles it.

The Decision Framework

Here is the framework we use at CODERCOPS when a client comes to us with an AI project. We have refined this over 20+ client engagements.

The Decision Tree

Follow this from top to bottom. The first "Yes" that matches your situation tells you where to start.

START: What does your AI system need to do?

1. Does it need access to your proprietary/private data
   to answer questions?
   |
   ├── YES: How much data?
   |   ├── Under 50 pages → TRY PROMPTING FIRST
   |   |   (Stuff it in the system prompt)
   |   ├── 50-500 pages → RAG
   |   └── 500+ pages → RAG (definitely)
   |
   └── NO: Continue to question 2

2. Does it need to follow a very specific output format
   or style consistently?
   |
   ├── YES: Can you describe the format clearly in
   |   2-3 paragraphs with examples?
   |   ├── YES → TRY PROMPTING FIRST (with few-shot
   |   |   examples)
   |   └── NO → FINE-TUNING (the style is too nuanced
   |       to describe, easier to show by example)
   |
   └── NO: Continue to question 3

3. Does it need domain-specific reasoning that general
   models get wrong?
   |
   ├── YES: Do you have 500+ examples of correct
   |   input/output pairs?
   |   ├── YES → FINE-TUNING
   |   └── NO → RAG + PROMPTING (use domain docs as
   |       context + detailed instructions)
   |
   └── NO: Continue to question 4

4. Is the task well-defined with clear instructions?
   |
   ├── YES → PROMPTING (just write a good prompt)
   └── NOStart with prompting to figure out what
       works, then add RAG or fine-tuning as needed

The "Just Try Prompting First" Rule

Here is a rule that will save you time and money: always try prompting first. Even if you think you need RAG or fine-tuning, spend 2-4 hours crafting a really good prompt with examples and testing it. You will be surprised how often this is enough.

In our experience at CODERCOPS, about 40% of clients who think they need fine-tuning actually just need better prompting. Another 35% need RAG. Only about 25% genuinely need fine-tuning.

The Cost Comparison: Real Numbers

This is the section most articles skip or fill with vague ranges. Here are real numbers based on what we have actually paid across client projects in 2025-2026.

Upfront Costs

Cost Category Prompting RAG Fine-Tuning
Data preparation $0 $500-3,000 $2,000-10,000
Infrastructure setup $0 $200-500 $0 (cloud-based)
Development time 4-16 hours ($200-800) 40-80 hours ($2,000-4,000) 20-60 hours ($1,000-3,000)
Training costs $0 $0 $50-2,000 (depends on model/data size)
Testing and evaluation 2-4 hours ($100-200) 8-16 hours ($400-800) 16-40 hours ($800-2,000)
Total upfront $200-1,000 $3,000-8,000 $4,000-17,000

Monthly Running Costs (at 10,000 queries/day)

Cost Category Prompting RAG Fine-Tuning
LLM API costs $300-900/mo $200-600/mo $150-500/mo
Vector DB hosting $0 $50-200/mo $0
Embedding API $0 $30-100/mo $0
Reranking API $0 $20-50/mo $0
Infrastructure $0 $100-300/mo $0-100/mo
Maintenance 2 hrs/mo ($100) 8 hrs/mo ($400) 4 hrs/mo ($200)
Total monthly $400-1,000/mo $800-1,650/mo $350-800/mo

Key insight: Fine-tuning has the highest upfront cost but the lowest monthly running cost. Prompting has the lowest upfront cost but the highest per-query cost (because prompts tend to be longer). RAG falls in the middle but requires ongoing infrastructure maintenance.

Cost Per Query Breakdown

Approach Avg Input Tokens Avg Output Tokens Cost Per Query
Prompting (Claude Sonnet) 2,000 500 $0.0085
Prompting (Claude Haiku) 2,000 500 $0.0005
RAG (Claude Sonnet, 5 chunks) 4,000 500 $0.0145
RAG (Claude Haiku, 5 chunks) 4,000 500 $0.0009
Fine-tuned GPT-4o-mini 500 500 $0.0006
Fine-tuned GPT-4o 500 500 $0.0075

Notice: Fine-tuned models use fewer input tokens because the knowledge is baked into the model weights -- you do not need lengthy system prompts or retrieved context. This makes them cheaper per query for high-volume use cases.

Latency Comparison

Speed matters. A customer-facing chatbot needs to respond in 1-2 seconds. An internal analysis tool can take 10 seconds. Here is what to expect:

Approach Time to First Token Total Response Time (500 tokens)
Prompting (Claude Sonnet) 0.3-0.5s 1.0-2.0s
Prompting (Claude Haiku) 0.1-0.3s 0.5-1.0s
RAG + Claude Sonnet 0.8-1.5s 1.5-3.5s
RAG + Claude Haiku 0.5-1.0s 1.0-2.5s
Fine-tuned GPT-4o-mini 0.2-0.4s 0.8-1.5s

The RAG latency tax: RAG adds 0.5-1.5 seconds for the retrieval step (vector search, reranking). This is usually acceptable, but for real-time applications where every millisecond counts, it matters. You can reduce this with caching, pre-retrieval, and faster vector databases.

Fine-tuning latency advantage: Because fine-tuned models use shorter prompts, they are consistently faster. If latency is your primary constraint, fine-tuning wins.

Accuracy Comparison Across Use Cases

This is the comparison that matters most, and it varies dramatically by use case:

Use Case 1: Customer Support Q&A (Company-Specific)

Approach Accuracy Notes
Prompting only 45% Model does not know company-specific info
RAG 89% Retrieves relevant policy docs
Fine-tuned 72% Learned patterns but struggles with rare queries
RAG + Fine-tuned 93% Best of both worlds

Winner: RAG. Customer support requires access to specific, frequently updated company data. Fine-tuning alone cannot keep up with policy changes.

Use Case 2: Medical Coding (ICD-10 Classification)

Approach Accuracy Notes
Prompting only 61% General knowledge of medical codes
RAG (with ICD-10 database) 78% Good retrieval but reasoning is weak
Fine-tuned (2000 examples) 91% Learned the classification patterns
RAG + Fine-tuned 94% Retrieval for rare codes + learned reasoning

Winner: Fine-tuning. Medical coding requires pattern recognition across thousands of codes with nuanced rules. This is exactly what fine-tuning excels at. RAG alone retrieves code descriptions but struggles with the reasoning to select the right one.

Use Case 3: Content Generation (Brand Voice)

Approach Accuracy (human rating) Notes
Prompting with style guide 6.2/10 Follows instructions but feels generic
RAG with brand content examples 6.8/10 Better but inconsistent
Fine-tuned on brand content 8.5/10 Consistently matches brand voice
Prompting with few-shot examples 7.1/10 Good enough for many use cases

Winner: Fine-tuning. Brand voice is a nuanced, hard-to-describe quality that is much easier to teach by example than to describe in instructions. Fine-tuning on 500-1000 examples of on-brand content produces remarkably consistent voice matching.

Use Case 4: Document Summarization (Generic)

Approach Accuracy Notes
Prompting only 88% Models are already great at this
RAG N/A Not applicable (input is the document itself)
Fine-tuned 90% Marginal improvement not worth the cost

Winner: Prompting. Modern LLMs are excellent summarizers out of the box. Fine-tuning offers marginal improvement at significant cost. Just write a good prompt.

Use Case 5: Code Generation (Company Codebase)

Approach Accuracy Notes
Prompting only 55% Does not know your codebase patterns
RAG (with codebase context) 82% Retrieves relevant code patterns
Fine-tuned on codebase 75% Learned patterns but codebase changes
RAG + CLAUDE.md prompting 87% Our actual approach at CODERCOPS

Winner: RAG + Prompting. Code generation needs both access to the current codebase (RAG) and understanding of conventions (prompting via CLAUDE.md). Fine-tuning on code is risky because codebases change rapidly.

Hybrid Approaches: When One Is Not Enough

The real world is messy, and often the best solution combines approaches.

RAG + Prompting (Our Most Common Recommendation)

Combine retrieval with a well-crafted system prompt that includes output format instructions, behavioral guidelines, and a few examples. This covers 60-70% of use cases.

system_prompt = """You are a tax advisory assistant.

RULES:
- Only answer based on the provided tax documents
- Always cite the specific regulation or section number
- If information is from before 2025, note that it may
  be outdated
- For questions outside tax scope, decline politely
- Present numbers in tables when comparing scenarios

EXAMPLES OF GOOD RESPONSES:
[include 2-3 examples]
"""

# Retrieved context gets injected into the user message
user_message = f"""Based on the following tax regulations:

{retrieved_context}

{user_question}"""

RAG + Fine-Tuned Model

For high-accuracy applications, you can fine-tune a model on your domain AND give it retrieved context at inference time. This is expensive but delivers the best results.

When we recommend this: Healthcare, legal, financial -- any domain where accuracy is critical and the cost of errors is high.

Prompt Chaining + RAG

Instead of one big RAG query, break the task into steps:

Step 1: Classify the query intent (prompting only)
Step 2: Retrieve relevant documents (RAG)
Step 3: Generate answer with citations (RAG + prompting)
Step 4: Verify answer against sources (prompting only)

This is our standard architecture for production RAG systems at CODERCOPS. Each step can use a different model (Haiku for classification, Sonnet for generation) to optimize cost.

5 Real Client Scenarios

Let me walk through five actual client projects and explain what we recommended and why.

Scenario 1: HR Chatbot for Employee Policy Questions

Client: 500-employee SaaS company. 200 pages of HR policies, benefits documents, and procedure guides.

What they wanted: An internal chatbot that employees can ask about policies, benefits, time off, etc.

What they thought they needed: Fine-tuning GPT-4 on their HR documents.

What we recommended: RAG with Claude Haiku.

Why: The information is in documents that change quarterly. Fine-tuning would require retraining every time a policy changes. RAG retrieves the current version automatically. Haiku is fast enough for chatbot interactions and costs 90% less than fine-tuning a larger model.

Result: 94% accuracy, $45/month running cost, 2-week build. Compared to the fine-tuning quote of $15,000 upfront + retraining costs.

Client: Law firm that reviews 50+ contracts per week. Needs to extract specific clauses (indemnification, termination, non-compete) and flag unusual terms.

What they wanted: An AI system that can read contracts and extract structured data.

What we recommended: Fine-tuned GPT-4o-mini + RAG for reference.

Why: Clause extraction requires understanding nuanced legal language and recognizing patterns that vary across contract styles. The firm provided 1,200 annotated contracts for training. RAG supplements with a reference database of standard clause templates for comparison.

Result: 92% extraction accuracy (vs. 71% with prompting alone). Upfront cost: $14,000. Monthly cost: $200. The fine-tuning was worth it here because the task is complex, the training data was available, and the firm processes enough contracts to justify the investment.

Scenario 3: Product Description Generator for E-Commerce

Client: Online retailer with 5,000 SKUs. Needs product descriptions in their brand voice.

What they wanted: An AI system to generate product descriptions that match their existing style.

What we recommended: Fine-tuning on 800 existing product descriptions.

Why: Brand voice is exactly the kind of nuanced, hard-to-describe pattern that fine-tuning handles best. The client had thousands of well-written descriptions to train on. Once trained, the model generates on-brand descriptions with a short product attributes prompt -- no retrieval needed, no long prompts needed.

Result: 8.7/10 brand voice match rating from the client's marketing team. $0.001 per description. Paid for itself in the first month by replacing the $5/description they were paying to a content agency.

Scenario 4: Technical Support Troubleshooting Bot

Client: IoT device manufacturer with complex troubleshooting procedures.

What they wanted: A bot that guides customers through troubleshooting, step by step.

What we recommended: RAG + prompt chaining.

Why: Troubleshooting is inherently a multi-step, branching process. The correct next step depends on the result of the previous step. We built a RAG system that retrieves the relevant troubleshooting tree, then a prompt chain that walks through it step-by-step with the customer.

Result: 87% of issues resolved without human escalation (up from 34% with their previous keyword-based chatbot). Monthly cost: $180. Fine-tuning would not have worked here because the troubleshooting procedures change with every firmware update.

Scenario 5: Internal Knowledge Search for Engineering Team

Client: 40-person engineering team with 12 years of internal wikis, runbooks, and post-mortems.

What they wanted: "Search that actually works" for their internal knowledge base.

What we recommended: Prompting only. Seriously.

Why: Their knowledge base was only 180 pages of critical content (the rest was outdated or redundant). We helped them curate it down to essentials, organized it clearly, and built a simple Claude-powered search that takes the user's question + the entire curated knowledge base as context. No vector database, no embeddings, no chunking.

Result: 91% accuracy. $0 infrastructure cost (just API calls). $60/month at their query volume. Built in 3 days. Sometimes the simplest solution is the right one.

Common Mistakes

Let me save you from the mistakes we see most often:

Mistake 1: Fine-Tuning When You Should Just Prompt Better

Symptom: "The model is not following my instructions."

Actual problem: Your prompt is vague, missing examples, or poorly structured.

Fix: Before spending money on fine-tuning, try: (1) adding 3-5 few-shot examples in your prompt, (2) being much more specific about what you want, (3) using XML tags to structure your prompt, (4) testing with a stronger model first. We have seen prompting improvements of 20-30% just from better prompt engineering.

Mistake 2: Building RAG When the Context Window Is Big Enough

Symptom: You built a full RAG pipeline for a knowledge base that is 30 pages long.

Actual problem: Over-engineering. You added a vector database, embeddings, chunking, reranking -- all for content that fits in a single prompt.

Fix: Calculate your total content size in tokens. If it is under 50,000 tokens (roughly 50 pages), try stuffing it in the context first. If accuracy is good enough, you just saved yourself weeks of RAG development and $200/month in infrastructure.

Mistake 3: Fine-Tuning on Bad Data

Symptom: Your fine-tuned model performs worse than the base model.

Actual problem: Your training data is inconsistent, contains errors, or does not represent the actual use case well.

Fix: Quality over quantity. 200 perfect examples beat 2,000 mediocre ones. Have domain experts review every training example. Remove contradictory examples. Make sure your training data distribution matches your real query distribution.

Mistake 4: Not Establishing a Prompting Baseline

Symptom: You cannot tell whether RAG or fine-tuning actually improved anything.

Actual problem: You did not measure how well plain prompting works on your task before adding complexity.

Fix: Always start with a prompting-only baseline. Measure accuracy on 100+ test queries. Then add RAG or fine-tuning and measure again. If the improvement is less than 10%, the added complexity may not be worth it.

Mistake 5: Ignoring the Maintenance Cost

Symptom: Your AI system worked great at launch and has been degrading ever since.

Actual problem: Documents changed but the RAG index was not updated. Or the fine-tuned model was trained on data that is now outdated.

Fix: Budget for maintenance from day one.

Approach Maintenance Burden
Prompting Low -- update the prompt when needed
RAG Medium -- keep the index fresh, monitor retrieval quality
Fine-tuning High -- retrain periodically, validate on new data

The "Just Use Claude with a Good Prompt" Baseline

I want to end with something that might be controversial in the AI services industry: for most use cases, a well-crafted prompt with a frontier model like Claude Sonnet or GPT-4o is the right starting point, and often the right ending point.

The AI industry has a bias toward complexity. Vendors want to sell you RAG pipelines and fine-tuning services. Conference talks showcase elaborate multi-model architectures. But the boring truth is that 40% of the AI projects that land on our desk at CODERCOPS are solvable with prompting alone.

Our approach:

  1. Start with prompting. Spend 4-8 hours crafting and testing prompts. Measure accuracy on a test set.
  2. If accuracy is above 85%, ship it. Iterate on the prompt as you get user feedback.
  3. If accuracy is below 85% and you have proprietary data, add RAG. Build incrementally -- start with basic vector search and add hybrid search/reranking only if basic RAG is not accurate enough.
  4. If accuracy is still below target and you have training data, consider fine-tuning. But only for the specific subtask where prompting and RAG fall short, not for the whole system.

This incremental approach saves our clients an average of $8,000-15,000 and 3-6 weeks compared to jumping straight to the most complex solution.

Quick Reference Summary

Factor Prompting RAG Fine-Tuning
Upfront cost $200-1,000 $3,000-8,000 $4,000-17,000
Monthly cost (10K queries/day) $400-1,000 $800-1,650 $350-800
Time to deploy 1-3 days 2-4 weeks 3-6 weeks
Handles private data Poorly (limited to prompt) Excellently Moderately
Handles style/format Moderately Poorly Excellently
Information freshness Always current (model knowledge) Current (if index updated) Stale (needs retraining)
Latency Low Medium Low
Maintenance burden Low Medium High
When to use Well-defined tasks, general knowledge Private data, changing knowledge Specific style, domain reasoning

Need Help Choosing the Right Approach?

At CODERCOPS, we have helped 20+ clients navigate this exact decision. We do not have a bias toward any particular approach -- we recommend whatever delivers the best accuracy at the lowest cost for your specific use case.

If you are building an AI-powered feature and are not sure whether you need RAG, fine-tuning, or just better prompting, let's talk. We will evaluate your use case, test prompting first (on our dime), and give you an honest recommendation with real cost projections.

Not ready for a conversation yet? Browse our blog for more practical guides on building with AI. Every post is based on real client work, not theory.

Comments