A client came to us last quarter wanting to fine-tune GPT-4 on their 50-page employee handbook. They had already gotten quotes from two vendors: $12,000-18,000 for data preparation, training, and deployment. Timeline: 3-4 weeks. They were ready to sign.
We saved them $15,000 and 3 weeks. We took their handbook, chunked it intelligently, built a simple RAG system with Supabase pgvector, and wrote a well-crafted system prompt. Total cost: $800 in development time, $12/month in infrastructure. The system answers employee questions with 96% accuracy. Better than the fine-tuned approach would have delivered, at a fraction of the cost.
But here is the thing -- I am not telling you this to say "RAG is always better" or "fine-tuning is a scam." Sometimes fine-tuning is exactly the right approach. Sometimes plain prompting is all you need. The problem is that most teams have no framework for making this decision. They either default to whatever the latest blog post recommended, or they let a vendor talk them into the most expensive option. This post gives you a clear, data-backed decision framework so you never have to guess.
The Three Approaches, Explained Simply
Before we get into the framework, let me make sure we are on the same page about what each approach actually is. I am going to explain these as if you are a smart business person who understands technology but is not neck-deep in ML research.
Approach 1: Prompt Engineering
What it is: You take a pre-trained model (like Claude or GPT-4) and give it carefully written instructions in the prompt. No training, no external data systems. Just a well-structured prompt with clear instructions, examples, and constraints.
Analogy: Hiring an expert consultant and giving them a detailed brief before each conversation. They bring their existing knowledge; you just guide how they apply it.
# Prompt engineering example
system_prompt = """You are a customer support agent for
TechCorp. Follow these rules exactly:
1. Always greet the customer by name if provided
2. For billing questions, reference our pricing tiers:
- Starter: $29/mo (5 users, 10GB storage)
- Professional: $99/mo (25 users, 100GB storage)
- Enterprise: $299/mo (unlimited users, 1TB storage)
3. For technical issues, collect: error message,
browser/device, steps to reproduce
4. Never promise refunds -- escalate to billing team
5. Respond in 2-3 sentences maximum
6. If unsure, say "Let me connect you with a specialist"
TONE: Professional but warm. Use the customer's name.
"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
system=system_prompt,
messages=[{"role": "user", "content": user_query}]
)When it works: The model already knows the subject matter, and you just need to control format, style, or behavior.
When it breaks: The model does not know your specific data (product details, company policies, proprietary information). You can only fit so much in a system prompt before it gets unwieldy.
Approach 2: Retrieval-Augmented Generation (RAG)
What it is: You store your documents in a searchable database (usually a vector database). When a user asks a question, you search for relevant documents, inject them into the prompt, and let the model answer based on that specific context.
Analogy: Giving the expert consultant a filing cabinet and telling them "look up the answer in these files before responding." They still use their expertise to interpret and synthesize, but the facts come from your documents.
# RAG example (simplified)
def answer_with_rag(user_query: str) -> str:
# Step 1: Find relevant documents
relevant_docs = vector_db.search(
query=user_query,
top_k=5
)
# Step 2: Build prompt with retrieved context
context = "\n\n".join([doc.text for doc in relevant_docs])
prompt = f"""Answer the user's question using ONLY
the information provided below. If the answer is not in
the provided context, say "I don't have information
about that."
CONTEXT:
{context}
USER QUESTION: {user_query}"""
# Step 3: Generate answer
response = client.messages.create(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].textWhen it works: You have a large body of proprietary documents that the model needs to reference. The information changes frequently. You need citations and traceability.
When it breaks: The retrieval step fails (pulls irrelevant documents), or the information requires deep reasoning that does not map well to "find the right chunk."
Approach 3: Fine-Tuning
What it is: You take a pre-trained model and train it further on your specific data. This modifies the model's weights, essentially teaching it new knowledge or behaviors that become part of the model itself.
Analogy: Sending the expert consultant to a specialized training program. When they come back, they have internalized new knowledge and behaviors. You do not need to brief them every time -- they just know.
# Fine-tuning example (OpenAI format)
# Step 1: Prepare training data
training_data = [
{
"messages": [
{
"role": "system",
"content": "You are a medical coding assistant."
},
{
"role": "user",
"content": "Patient presents with acute "
"bronchitis, prescribed amoxicillin"
},
{
"role": "assistant",
"content": "ICD-10: J20.9 (Acute bronchitis, "
"unspecified)\nCPT: 99213 "
"(Office visit, established patient, "
"low complexity)"
}
]
},
# ... hundreds more examples
]
# Step 2: Upload and train
file = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18"
)
# Step 3: Use the fine-tuned model
response = client.chat.completions.create(
model="ft:gpt-4o-mini-2024-07-18:your-org::job-id",
messages=[{"role": "user", "content": new_query}]
)When it works: You need the model to learn a specific style, format, or domain-specific reasoning pattern. You have high-quality training data. The task is repeated frequently enough to justify the upfront cost.
When it breaks: You do not have enough quality training data (need 100+ examples minimum, 1000+ for good results). The information changes frequently (you would need to retrain). The task is simple enough that prompting handles it.
The Decision Framework
Here is the framework we use at CODERCOPS when a client comes to us with an AI project. We have refined this over 20+ client engagements.
The Decision Tree
Follow this from top to bottom. The first "Yes" that matches your situation tells you where to start.
START: What does your AI system need to do?
1. Does it need access to your proprietary/private data
to answer questions?
|
├── YES: How much data?
| ├── Under 50 pages → TRY PROMPTING FIRST
| | (Stuff it in the system prompt)
| ├── 50-500 pages → RAG
| └── 500+ pages → RAG (definitely)
|
└── NO: Continue to question 2
2. Does it need to follow a very specific output format
or style consistently?
|
├── YES: Can you describe the format clearly in
| 2-3 paragraphs with examples?
| ├── YES → TRY PROMPTING FIRST (with few-shot
| | examples)
| └── NO → FINE-TUNING (the style is too nuanced
| to describe, easier to show by example)
|
└── NO: Continue to question 3
3. Does it need domain-specific reasoning that general
models get wrong?
|
├── YES: Do you have 500+ examples of correct
| input/output pairs?
| ├── YES → FINE-TUNING
| └── NO → RAG + PROMPTING (use domain docs as
| context + detailed instructions)
|
└── NO: Continue to question 4
4. Is the task well-defined with clear instructions?
|
├── YES → PROMPTING (just write a good prompt)
└── NO → Start with prompting to figure out what
works, then add RAG or fine-tuning as neededThe "Just Try Prompting First" Rule
Here is a rule that will save you time and money: always try prompting first. Even if you think you need RAG or fine-tuning, spend 2-4 hours crafting a really good prompt with examples and testing it. You will be surprised how often this is enough.
In our experience at CODERCOPS, about 40% of clients who think they need fine-tuning actually just need better prompting. Another 35% need RAG. Only about 25% genuinely need fine-tuning.
The Cost Comparison: Real Numbers
This is the section most articles skip or fill with vague ranges. Here are real numbers based on what we have actually paid across client projects in 2025-2026.
Upfront Costs
| Cost Category | Prompting | RAG | Fine-Tuning |
|---|---|---|---|
| Data preparation | $0 | $500-3,000 | $2,000-10,000 |
| Infrastructure setup | $0 | $200-500 | $0 (cloud-based) |
| Development time | 4-16 hours ($200-800) | 40-80 hours ($2,000-4,000) | 20-60 hours ($1,000-3,000) |
| Training costs | $0 | $0 | $50-2,000 (depends on model/data size) |
| Testing and evaluation | 2-4 hours ($100-200) | 8-16 hours ($400-800) | 16-40 hours ($800-2,000) |
| Total upfront | $200-1,000 | $3,000-8,000 | $4,000-17,000 |
Monthly Running Costs (at 10,000 queries/day)
| Cost Category | Prompting | RAG | Fine-Tuning |
|---|---|---|---|
| LLM API costs | $300-900/mo | $200-600/mo | $150-500/mo |
| Vector DB hosting | $0 | $50-200/mo | $0 |
| Embedding API | $0 | $30-100/mo | $0 |
| Reranking API | $0 | $20-50/mo | $0 |
| Infrastructure | $0 | $100-300/mo | $0-100/mo |
| Maintenance | 2 hrs/mo ($100) | 8 hrs/mo ($400) | 4 hrs/mo ($200) |
| Total monthly | $400-1,000/mo | $800-1,650/mo | $350-800/mo |
Key insight: Fine-tuning has the highest upfront cost but the lowest monthly running cost. Prompting has the lowest upfront cost but the highest per-query cost (because prompts tend to be longer). RAG falls in the middle but requires ongoing infrastructure maintenance.
Cost Per Query Breakdown
| Approach | Avg Input Tokens | Avg Output Tokens | Cost Per Query |
|---|---|---|---|
| Prompting (Claude Sonnet) | 2,000 | 500 | $0.0085 |
| Prompting (Claude Haiku) | 2,000 | 500 | $0.0005 |
| RAG (Claude Sonnet, 5 chunks) | 4,000 | 500 | $0.0145 |
| RAG (Claude Haiku, 5 chunks) | 4,000 | 500 | $0.0009 |
| Fine-tuned GPT-4o-mini | 500 | 500 | $0.0006 |
| Fine-tuned GPT-4o | 500 | 500 | $0.0075 |
Notice: Fine-tuned models use fewer input tokens because the knowledge is baked into the model weights -- you do not need lengthy system prompts or retrieved context. This makes them cheaper per query for high-volume use cases.
Latency Comparison
Speed matters. A customer-facing chatbot needs to respond in 1-2 seconds. An internal analysis tool can take 10 seconds. Here is what to expect:
| Approach | Time to First Token | Total Response Time (500 tokens) |
|---|---|---|
| Prompting (Claude Sonnet) | 0.3-0.5s | 1.0-2.0s |
| Prompting (Claude Haiku) | 0.1-0.3s | 0.5-1.0s |
| RAG + Claude Sonnet | 0.8-1.5s | 1.5-3.5s |
| RAG + Claude Haiku | 0.5-1.0s | 1.0-2.5s |
| Fine-tuned GPT-4o-mini | 0.2-0.4s | 0.8-1.5s |
The RAG latency tax: RAG adds 0.5-1.5 seconds for the retrieval step (vector search, reranking). This is usually acceptable, but for real-time applications where every millisecond counts, it matters. You can reduce this with caching, pre-retrieval, and faster vector databases.
Fine-tuning latency advantage: Because fine-tuned models use shorter prompts, they are consistently faster. If latency is your primary constraint, fine-tuning wins.
Accuracy Comparison Across Use Cases
This is the comparison that matters most, and it varies dramatically by use case:
Use Case 1: Customer Support Q&A (Company-Specific)
| Approach | Accuracy | Notes |
|---|---|---|
| Prompting only | 45% | Model does not know company-specific info |
| RAG | 89% | Retrieves relevant policy docs |
| Fine-tuned | 72% | Learned patterns but struggles with rare queries |
| RAG + Fine-tuned | 93% | Best of both worlds |
Winner: RAG. Customer support requires access to specific, frequently updated company data. Fine-tuning alone cannot keep up with policy changes.
Use Case 2: Medical Coding (ICD-10 Classification)
| Approach | Accuracy | Notes |
|---|---|---|
| Prompting only | 61% | General knowledge of medical codes |
| RAG (with ICD-10 database) | 78% | Good retrieval but reasoning is weak |
| Fine-tuned (2000 examples) | 91% | Learned the classification patterns |
| RAG + Fine-tuned | 94% | Retrieval for rare codes + learned reasoning |
Winner: Fine-tuning. Medical coding requires pattern recognition across thousands of codes with nuanced rules. This is exactly what fine-tuning excels at. RAG alone retrieves code descriptions but struggles with the reasoning to select the right one.
Use Case 3: Content Generation (Brand Voice)
| Approach | Accuracy (human rating) | Notes |
|---|---|---|
| Prompting with style guide | 6.2/10 | Follows instructions but feels generic |
| RAG with brand content examples | 6.8/10 | Better but inconsistent |
| Fine-tuned on brand content | 8.5/10 | Consistently matches brand voice |
| Prompting with few-shot examples | 7.1/10 | Good enough for many use cases |
Winner: Fine-tuning. Brand voice is a nuanced, hard-to-describe quality that is much easier to teach by example than to describe in instructions. Fine-tuning on 500-1000 examples of on-brand content produces remarkably consistent voice matching.
Use Case 4: Document Summarization (Generic)
| Approach | Accuracy | Notes |
|---|---|---|
| Prompting only | 88% | Models are already great at this |
| RAG | N/A | Not applicable (input is the document itself) |
| Fine-tuned | 90% | Marginal improvement not worth the cost |
Winner: Prompting. Modern LLMs are excellent summarizers out of the box. Fine-tuning offers marginal improvement at significant cost. Just write a good prompt.
Use Case 5: Code Generation (Company Codebase)
| Approach | Accuracy | Notes |
|---|---|---|
| Prompting only | 55% | Does not know your codebase patterns |
| RAG (with codebase context) | 82% | Retrieves relevant code patterns |
| Fine-tuned on codebase | 75% | Learned patterns but codebase changes |
| RAG + CLAUDE.md prompting | 87% | Our actual approach at CODERCOPS |
Winner: RAG + Prompting. Code generation needs both access to the current codebase (RAG) and understanding of conventions (prompting via CLAUDE.md). Fine-tuning on code is risky because codebases change rapidly.
Hybrid Approaches: When One Is Not Enough
The real world is messy, and often the best solution combines approaches.
RAG + Prompting (Our Most Common Recommendation)
Combine retrieval with a well-crafted system prompt that includes output format instructions, behavioral guidelines, and a few examples. This covers 60-70% of use cases.
system_prompt = """You are a tax advisory assistant.
RULES:
- Only answer based on the provided tax documents
- Always cite the specific regulation or section number
- If information is from before 2025, note that it may
be outdated
- For questions outside tax scope, decline politely
- Present numbers in tables when comparing scenarios
EXAMPLES OF GOOD RESPONSES:
[include 2-3 examples]
"""
# Retrieved context gets injected into the user message
user_message = f"""Based on the following tax regulations:
{retrieved_context}
{user_question}"""RAG + Fine-Tuned Model
For high-accuracy applications, you can fine-tune a model on your domain AND give it retrieved context at inference time. This is expensive but delivers the best results.
When we recommend this: Healthcare, legal, financial -- any domain where accuracy is critical and the cost of errors is high.
Prompt Chaining + RAG
Instead of one big RAG query, break the task into steps:
Step 1: Classify the query intent (prompting only)
Step 2: Retrieve relevant documents (RAG)
Step 3: Generate answer with citations (RAG + prompting)
Step 4: Verify answer against sources (prompting only)This is our standard architecture for production RAG systems at CODERCOPS. Each step can use a different model (Haiku for classification, Sonnet for generation) to optimize cost.
5 Real Client Scenarios
Let me walk through five actual client projects and explain what we recommended and why.
Scenario 1: HR Chatbot for Employee Policy Questions
Client: 500-employee SaaS company. 200 pages of HR policies, benefits documents, and procedure guides.
What they wanted: An internal chatbot that employees can ask about policies, benefits, time off, etc.
What they thought they needed: Fine-tuning GPT-4 on their HR documents.
What we recommended: RAG with Claude Haiku.
Why: The information is in documents that change quarterly. Fine-tuning would require retraining every time a policy changes. RAG retrieves the current version automatically. Haiku is fast enough for chatbot interactions and costs 90% less than fine-tuning a larger model.
Result: 94% accuracy, $45/month running cost, 2-week build. Compared to the fine-tuning quote of $15,000 upfront + retraining costs.
Scenario 2: Legal Contract Clause Extraction
Client: Law firm that reviews 50+ contracts per week. Needs to extract specific clauses (indemnification, termination, non-compete) and flag unusual terms.
What they wanted: An AI system that can read contracts and extract structured data.
What we recommended: Fine-tuned GPT-4o-mini + RAG for reference.
Why: Clause extraction requires understanding nuanced legal language and recognizing patterns that vary across contract styles. The firm provided 1,200 annotated contracts for training. RAG supplements with a reference database of standard clause templates for comparison.
Result: 92% extraction accuracy (vs. 71% with prompting alone). Upfront cost: $14,000. Monthly cost: $200. The fine-tuning was worth it here because the task is complex, the training data was available, and the firm processes enough contracts to justify the investment.
Scenario 3: Product Description Generator for E-Commerce
Client: Online retailer with 5,000 SKUs. Needs product descriptions in their brand voice.
What they wanted: An AI system to generate product descriptions that match their existing style.
What we recommended: Fine-tuning on 800 existing product descriptions.
Why: Brand voice is exactly the kind of nuanced, hard-to-describe pattern that fine-tuning handles best. The client had thousands of well-written descriptions to train on. Once trained, the model generates on-brand descriptions with a short product attributes prompt -- no retrieval needed, no long prompts needed.
Result: 8.7/10 brand voice match rating from the client's marketing team. $0.001 per description. Paid for itself in the first month by replacing the $5/description they were paying to a content agency.
Scenario 4: Technical Support Troubleshooting Bot
Client: IoT device manufacturer with complex troubleshooting procedures.
What they wanted: A bot that guides customers through troubleshooting, step by step.
What we recommended: RAG + prompt chaining.
Why: Troubleshooting is inherently a multi-step, branching process. The correct next step depends on the result of the previous step. We built a RAG system that retrieves the relevant troubleshooting tree, then a prompt chain that walks through it step-by-step with the customer.
Result: 87% of issues resolved without human escalation (up from 34% with their previous keyword-based chatbot). Monthly cost: $180. Fine-tuning would not have worked here because the troubleshooting procedures change with every firmware update.
Scenario 5: Internal Knowledge Search for Engineering Team
Client: 40-person engineering team with 12 years of internal wikis, runbooks, and post-mortems.
What they wanted: "Search that actually works" for their internal knowledge base.
What we recommended: Prompting only. Seriously.
Why: Their knowledge base was only 180 pages of critical content (the rest was outdated or redundant). We helped them curate it down to essentials, organized it clearly, and built a simple Claude-powered search that takes the user's question + the entire curated knowledge base as context. No vector database, no embeddings, no chunking.
Result: 91% accuracy. $0 infrastructure cost (just API calls). $60/month at their query volume. Built in 3 days. Sometimes the simplest solution is the right one.
Common Mistakes
Let me save you from the mistakes we see most often:
Mistake 1: Fine-Tuning When You Should Just Prompt Better
Symptom: "The model is not following my instructions."
Actual problem: Your prompt is vague, missing examples, or poorly structured.
Fix: Before spending money on fine-tuning, try: (1) adding 3-5 few-shot examples in your prompt, (2) being much more specific about what you want, (3) using XML tags to structure your prompt, (4) testing with a stronger model first. We have seen prompting improvements of 20-30% just from better prompt engineering.
Mistake 2: Building RAG When the Context Window Is Big Enough
Symptom: You built a full RAG pipeline for a knowledge base that is 30 pages long.
Actual problem: Over-engineering. You added a vector database, embeddings, chunking, reranking -- all for content that fits in a single prompt.
Fix: Calculate your total content size in tokens. If it is under 50,000 tokens (roughly 50 pages), try stuffing it in the context first. If accuracy is good enough, you just saved yourself weeks of RAG development and $200/month in infrastructure.
Mistake 3: Fine-Tuning on Bad Data
Symptom: Your fine-tuned model performs worse than the base model.
Actual problem: Your training data is inconsistent, contains errors, or does not represent the actual use case well.
Fix: Quality over quantity. 200 perfect examples beat 2,000 mediocre ones. Have domain experts review every training example. Remove contradictory examples. Make sure your training data distribution matches your real query distribution.
Mistake 4: Not Establishing a Prompting Baseline
Symptom: You cannot tell whether RAG or fine-tuning actually improved anything.
Actual problem: You did not measure how well plain prompting works on your task before adding complexity.
Fix: Always start with a prompting-only baseline. Measure accuracy on 100+ test queries. Then add RAG or fine-tuning and measure again. If the improvement is less than 10%, the added complexity may not be worth it.
Mistake 5: Ignoring the Maintenance Cost
Symptom: Your AI system worked great at launch and has been degrading ever since.
Actual problem: Documents changed but the RAG index was not updated. Or the fine-tuned model was trained on data that is now outdated.
Fix: Budget for maintenance from day one.
| Approach | Maintenance Burden |
|---|---|
| Prompting | Low -- update the prompt when needed |
| RAG | Medium -- keep the index fresh, monitor retrieval quality |
| Fine-tuning | High -- retrain periodically, validate on new data |
The "Just Use Claude with a Good Prompt" Baseline
I want to end with something that might be controversial in the AI services industry: for most use cases, a well-crafted prompt with a frontier model like Claude Sonnet or GPT-4o is the right starting point, and often the right ending point.
The AI industry has a bias toward complexity. Vendors want to sell you RAG pipelines and fine-tuning services. Conference talks showcase elaborate multi-model architectures. But the boring truth is that 40% of the AI projects that land on our desk at CODERCOPS are solvable with prompting alone.
Our approach:
- Start with prompting. Spend 4-8 hours crafting and testing prompts. Measure accuracy on a test set.
- If accuracy is above 85%, ship it. Iterate on the prompt as you get user feedback.
- If accuracy is below 85% and you have proprietary data, add RAG. Build incrementally -- start with basic vector search and add hybrid search/reranking only if basic RAG is not accurate enough.
- If accuracy is still below target and you have training data, consider fine-tuning. But only for the specific subtask where prompting and RAG fall short, not for the whole system.
This incremental approach saves our clients an average of $8,000-15,000 and 3-6 weeks compared to jumping straight to the most complex solution.
Quick Reference Summary
| Factor | Prompting | RAG | Fine-Tuning |
|---|---|---|---|
| Upfront cost | $200-1,000 | $3,000-8,000 | $4,000-17,000 |
| Monthly cost (10K queries/day) | $400-1,000 | $800-1,650 | $350-800 |
| Time to deploy | 1-3 days | 2-4 weeks | 3-6 weeks |
| Handles private data | Poorly (limited to prompt) | Excellently | Moderately |
| Handles style/format | Moderately | Poorly | Excellently |
| Information freshness | Always current (model knowledge) | Current (if index updated) | Stale (needs retraining) |
| Latency | Low | Medium | Low |
| Maintenance burden | Low | Medium | High |
| When to use | Well-defined tasks, general knowledge | Private data, changing knowledge | Specific style, domain reasoning |
Need Help Choosing the Right Approach?
At CODERCOPS, we have helped 20+ clients navigate this exact decision. We do not have a bias toward any particular approach -- we recommend whatever delivers the best accuracy at the lowest cost for your specific use case.
If you are building an AI-powered feature and are not sure whether you need RAG, fine-tuning, or just better prompting, let's talk. We will evaluate your use case, test prompting first (on our dime), and give you an honest recommendation with real cost projections.
Not ready for a conversation yet? Browse our blog for more practical guides on building with AI. Every post is based on real client work, not theory.
Comments