We have built 14 AI agent systems for clients in the last 18 months. Nine of the first attempts failed spectacularly. Not "didn't quite work" failed. I mean "billed $2,400 in API costs overnight while stuck in an infinite loop" failed. "Emailed a client's customer complete nonsense" failed. "Confidently called a function that doesn't exist" failed.
Those failures taught us more than any documentation or conference talk ever could. And after burning through them, we now have a set of patterns that reliably produce AI agents that actually work in production -- agents that handle edge cases, degrade gracefully, and don't bankrupt your API budget at 3 AM.
This is not a tutorial on building a toy agent that can search the web and summarize results. This is what we actually do at CODERCOPS when a client needs an agent system that handles real workloads, real users, and real consequences when things go wrong. If you are evaluating agent frameworks, building your first production agent, or trying to figure out why your current agent keeps failing, this post is for you.
The Agent Framework Landscape in 2026 -- An Honest Assessment
Let me save you weeks of evaluation. Here is where the major agent frameworks actually stand, based on our production experience with all of them.
LangGraph
LangGraph is the framework we reach for most often. It models agent workflows as directed graphs with explicit state management. That sounds academic, but in practice it means you can see exactly what your agent is doing, checkpoint its progress, and resume from failures.
What we love: Explicit state management, built-in persistence, human-in-the-loop support baked in, excellent debugging. You can visualize the entire agent flow as a graph.
What burns us: The learning curve is steep. New engineers on our team take 2-3 weeks to get comfortable. The abstraction layers can feel heavy for simple use cases. Documentation, while improved, still has gaps in advanced patterns.
CrewAI
CrewAI takes a role-based approach where you define "agents" with specific personas and let them collaborate. It is great for demos and prototypes.
What we love: Fast to prototype, intuitive mental model, good for non-technical stakeholders to understand.
What burns us: Production reliability is inconsistent. The multi-agent coordination often produces redundant work. Error handling is limited. We have had agents in a "crew" argue with each other in circles. Fine for internal tools, risky for client-facing systems.
Claude Agent SDK
Anthropic's Agent SDK is relatively new but has become our go-to for simpler agent workflows. It integrates tightly with Claude's tool-use capabilities and the Model Context Protocol.
What we love: Clean API, excellent tool-use reliability with Claude models, built-in guardrails, great TypeScript support. The handoff pattern between agents is elegant.
What burns us: Locked into Anthropic's ecosystem. If you need multi-model support or want to swap in GPT for certain tasks, you are writing custom adapters.
AutoGen
Microsoft's AutoGen framework supports multi-agent conversations with code execution.
What we love: Good for research and experimentation. The code execution sandbox is genuinely useful for data analysis agents.
What burns us: Production readiness is questionable. We have had stability issues in long-running workflows. The multi-agent conversation pattern can be unpredictable with complex tasks.
Our Honest Comparison
| Framework | Production Ready | Learning Curve | Debugging | Multi-Model | Best For |
|---|---|---|---|---|---|
| LangGraph | 9/10 | Steep (2-3 weeks) | Excellent | Yes | Complex workflows |
| CrewAI | 5/10 | Easy (2-3 days) | Limited | Yes | Prototypes, internal tools |
| Claude Agent SDK | 8/10 | Moderate (1 week) | Good | No (Claude only) | Claude-native apps |
| AutoGen | 4/10 | Moderate | Fair | Yes | Research, data analysis |
Our default choice: LangGraph for complex multi-step workflows. Claude Agent SDK for simpler agent systems where we are already using Claude. We almost never recommend CrewAI or AutoGen for production client work anymore.
The 5 Patterns That Actually Work in Production
After all those failures, we distilled our approach into five non-negotiable patterns. Every agent system we build includes all five.
Pattern 1: Human-in-the-Loop Checkpoints
This is the single most important pattern. Full stop.
The fantasy of fully autonomous agents is exactly that -- a fantasy. In production, you need explicit points where a human reviews and approves the agent's work before it takes irreversible actions.
Here is how we implement it in LangGraph:
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import MemorySaver
from typing import TypedDict, Literal
class AgentState(TypedDict):
task: str
research_results: list[str]
draft_output: str
human_approved: bool
final_output: str
def research_node(state: AgentState) -> AgentState:
"""Agent does research -- no approval needed."""
results = perform_research(state["task"])
return {"research_results": results}
def draft_node(state: AgentState) -> AgentState:
"""Agent drafts output -- this goes to human review."""
draft = generate_draft(state["research_results"])
return {"draft_output": draft, "human_approved": False}
def should_continue(state: AgentState) -> Literal["execute", "wait_for_human"]:
"""Route based on whether human has approved."""
if state.get("human_approved"):
return "execute"
return "wait_for_human"
def execute_node(state: AgentState) -> AgentState:
"""Only runs after human approval."""
result = execute_action(state["draft_output"])
return {"final_output": result}
# Build the graph with an interrupt point
graph = StateGraph(AgentState)
graph.add_node("research", research_node)
graph.add_node("draft", draft_node)
graph.add_node("execute", execute_node)
graph.set_entry_point("research")
graph.add_edge("research", "draft")
graph.add_conditional_edges("draft", should_continue)
graph.add_edge("execute", END)
# The interrupt_before tells LangGraph to pause before execute
app = graph.compile(
checkpointer=MemorySaver(),
interrupt_before=["execute"]
)The key insight: We categorize every agent action as either "safe" (research, summarizing, drafting) or "dangerous" (sending emails, updating databases, making API calls, spending money). Safe actions run autonomously. Dangerous actions always hit a checkpoint.
For one fintech client, we built an AI research agent that could analyze market data autonomously but required human approval before generating any client-facing reports. This single pattern prevented three incidents in the first month where the agent would have sent reports with incorrect data.
Pattern 2: Structured Output Validation
LLMs generate text. But your systems need structured data. The gap between "the model said the right thing" and "the model returned valid JSON with all required fields" is where agents break.
We enforce structured outputs at every boundary:
from pydantic import BaseModel, Field, validator
from typing import Optional
import json
class ResearchResult(BaseModel):
query: str
sources: list[str] = Field(min_length=1)
summary: str = Field(min_length=50, max_length=2000)
confidence: float = Field(ge=0.0, le=1.0)
key_findings: list[str] = Field(min_length=1, max_length=10)
@validator("sources")
def validate_sources(cls, v):
for source in v:
if not source.startswith("http"):
raise ValueError(f"Invalid source URL: {source}")
return v
class ToolCallResult(BaseModel):
tool_name: str
success: bool
result: Optional[dict] = None
error: Optional[str] = None
retry_count: int = 0
def validate_agent_output(raw_output: str, schema: type[BaseModel]) -> BaseModel:
"""Validate and parse agent output with retry logic."""
try:
parsed = json.loads(raw_output)
return schema(**parsed)
except (json.JSONDecodeError, ValueError) as e:
# Ask the model to fix its output
correction_prompt = f"""
Your previous output was invalid. Error: {str(e)}
Please fix and return valid JSON matching this schema:
{schema.model_json_schema()}
Original output: {raw_output}
"""
corrected = call_llm(correction_prompt)
return schema(**json.loads(corrected))Critical detail: We give the model exactly 2 chances to produce valid output. If it fails twice, we log the failure, return a structured error, and let the calling system handle it. No infinite retry loops. This validation layer catches about 15% of agent outputs that would otherwise cause downstream failures.
Pattern 3: Tool Call Retry with Exponential Backoff
Tools fail. APIs time out. Rate limits hit. Your agent needs to handle this gracefully, not crash or hallucinate a response.
import asyncio
import logging
from functools import wraps
logger = logging.getLogger(__name__)
def resilient_tool(max_retries: int = 3, base_delay: float = 1.0):
"""Decorator for agent tools with retry logic."""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
last_error = None
for attempt in range(max_retries):
try:
result = await func(*args, **kwargs)
return ToolCallResult(
tool_name=func.__name__,
success=True,
result=result,
retry_count=attempt
)
except RateLimitError:
delay = base_delay * (2 ** attempt)
logger.warning(
f"Rate limited on {func.__name__}, "
f"retry {attempt + 1}/{max_retries} "
f"after {delay}s"
)
await asyncio.sleep(delay)
last_error = "Rate limited"
except TimeoutError:
delay = base_delay * (2 ** attempt)
logger.warning(
f"Timeout on {func.__name__}, "
f"retry {attempt + 1}/{max_retries}"
)
await asyncio.sleep(delay)
last_error = "Timeout"
except Exception as e:
logger.error(
f"Unexpected error in {func.__name__}: {e}"
)
last_error = str(e)
break # Don't retry unexpected errors
return ToolCallResult(
tool_name=func.__name__,
success=False,
error=last_error,
retry_count=max_retries
)
return wrapper
return decorator
@resilient_tool(max_retries=3, base_delay=2.0)
async def search_database(query: str) -> dict:
"""Example tool with built-in resilience."""
results = await db.execute(query)
return {"matches": results, "count": len(results)}What we learned the hard way: Without this pattern, our fintech research agent would crash on the third API call when the data provider's rate limiter kicked in. With it, the agent gracefully retries and completes its work. The retry count in the result also lets us monitor tool reliability and catch degrading APIs before they become outages.
Pattern 4: Memory Management and Context Pruning
This is the pattern most teams skip, and it is why their agents work in testing but fail in production.
LLMs have context windows. Even Claude's 200K tokens run out when your agent has been running for 30 steps, each with tool calls and results. You need an active memory management strategy.
from typing import TypedDict
class ManagedMemory:
def __init__(self, max_tokens: int = 100000):
self.max_tokens = max_tokens
self.short_term: list[dict] = [] # Recent messages
self.long_term: list[str] = [] # Summarized history
self.facts: dict[str, str] = {} # Extracted key facts
def add_interaction(self, role: str, content: str):
"""Add a new interaction, pruning if necessary."""
self.short_term.append({
"role": role,
"content": content
})
current_tokens = self._estimate_tokens()
if current_tokens > self.max_tokens * 0.8:
self._prune()
def _prune(self):
"""Summarize old interactions and extract key facts."""
if len(self.short_term) < 4:
return
# Take the oldest half of short-term memory
to_summarize = self.short_term[:len(self.short_term) // 2]
self.short_term = self.short_term[len(self.short_term) // 2:]
# Summarize and extract facts
summary = summarize_interactions(to_summarize)
self.long_term.append(summary)
new_facts = extract_key_facts(to_summarize)
self.facts.update(new_facts)
def get_context(self) -> str:
"""Build the context for the next LLM call."""
context_parts = []
if self.facts:
context_parts.append(
"KEY FACTS:\n" +
"\n".join(f"- {k}: {v}" for k, v in self.facts.items())
)
if self.long_term:
context_parts.append(
"PREVIOUS CONTEXT:\n" +
"\n---\n".join(self.long_term[-3:]) # Last 3 summaries
)
context_parts.append(
"RECENT INTERACTIONS:\n" +
"\n".join(
f"{m['role']}: {m['content']}"
for m in self.short_term
)
)
return "\n\n".join(context_parts)
def _estimate_tokens(self) -> int:
"""Rough token estimate (4 chars per token)."""
total_chars = sum(
len(m["content"]) for m in self.short_term
)
total_chars += sum(len(s) for s in self.long_term)
total_chars += sum(
len(k) + len(v) for k, v in self.facts.items()
)
return total_chars // 4The three-tier approach: We keep recent interactions in full (short-term), summarize older interactions (long-term), and extract immutable facts (key facts like user preferences, established constraints, confirmed data points). This lets agents run for hundreds of steps without losing important context.
For a client's customer support agent that handled complex multi-turn troubleshooting, this pattern reduced context-related errors by 73% and cut token costs by 40%.
Pattern 5: Graceful Degradation
When an agent fails, it should not just crash. It should fail in a way that is useful.
class DegradationStrategy:
"""Define what to do when different parts of the agent fail."""
def __init__(self):
self.fallbacks = {}
def register_fallback(self, capability: str, fallback_fn):
self.fallbacks[capability] = fallback_fn
async def execute_with_fallback(
self,
capability: str,
primary_fn,
*args,
**kwargs
):
try:
return await primary_fn(*args, **kwargs)
except Exception as e:
logger.warning(
f"Primary {capability} failed: {e}. "
f"Using fallback."
)
if capability in self.fallbacks:
try:
return await self.fallbacks[capability](
*args, **kwargs
)
except Exception as fallback_error:
logger.error(
f"Fallback for {capability} also failed: "
f"{fallback_error}"
)
# Return a structured "I couldn't do this" response
return {
"status": "degraded",
"capability": capability,
"message": (
f"I was unable to complete the "
f"{capability} step. "
f"Here is what I was trying to do and "
f"what you can do manually: ..."
),
"error": str(e),
"manual_steps": get_manual_instructions(capability)
}
# Usage
strategy = DegradationStrategy()
# If real-time data fails, use cached data
strategy.register_fallback(
"market_data",
fetch_cached_market_data
)
# If AI summary fails, return raw data with a template
strategy.register_fallback(
"summarize",
return_raw_with_template
)The principle: An agent that says "I could not complete step 3, but here is what I did complete and here is how you can finish manually" is infinitely more useful than an agent that silently fails or returns garbage.
The Pitfalls That Will Burn You
Let me walk you through the failures we have seen so you do not have to repeat them.
Pitfall 1: Too Many Tools Confuse the Agent
We built an agent for a client with 23 available tools. It was a disaster. The agent would pick the wrong tool 30% of the time, sometimes calling a "delete" function when it meant to call "archive."
The fix: Limit any single agent to 7-10 tools maximum. If you need more capabilities, use a multi-agent architecture where a router agent delegates to specialized sub-agents with focused tool sets.
# BAD: One agent with everything
tools = [
search, create, update, delete, archive,
restore, export, import_, analyze, summarize,
email, slack, sms, schedule, remind,
format, validate, transform, enrich,
compare, merge, split, filter
] # 23 tools -- the agent will be confused
# GOOD: Router + specialized agents
router_tools = [
delegate_to_research_agent,
delegate_to_action_agent,
delegate_to_communication_agent,
respond_to_user
] # 4 tools -- clear routing decisions
research_agent_tools = [search, analyze, summarize, compare]
action_agent_tools = [create, update, archive, transform]
communication_agent_tools = [email, slack, schedule]Pitfall 2: No Error Handling for Tool Failures
This one seems obvious but we see it in almost every agent codebase we audit. The agent calls a tool, the tool fails, and the agent either crashes or -- worse -- hallucinates a response as if the tool succeeded.
The fix: Every tool call must return a structured result (success/failure), and the agent's prompt must explicitly instruct it to handle failures.
You have access to the following tools. When a tool call fails,
you MUST:
1. Report the failure to the user
2. Explain what you were trying to do
3. Suggest an alternative approach or manual workaround
4. Do NOT make up or guess the resultPitfall 3: Hallucinated Function Calls
This is terrifying. The agent "calls" a function that does not exist, or calls a real function with completely fabricated parameters.
We had an agent try to call database.execute_raw_sql("DROP TABLE users") -- a function that existed in the tools but with completely hallucinated parameters. Thank god for our parameter validation layer.
The fix: Validate every parameter of every tool call against a strict schema. Never let raw LLM output reach your systems without validation. Use Pydantic models or JSON Schema validation at the tool boundary.
Pitfall 4: Infinite Loops
An agent gets stuck in a cycle: tries something, fails, tries the same thing, fails, tries again. We once had an agent rack up $2,400 in API costs overnight because it was stuck trying to parse a malformed PDF.
The fix: Hard limits on everything.
MAX_STEPS = 25 # Total steps per task
MAX_RETRIES = 3 # Retries per tool call
MAX_TOKENS = 500000 # Total token budget per task
MAX_DURATION = 300 # 5 minutes max wall clock time
MAX_COST = 5.00 # $5 max spend per task
class AgentGuardrails:
def __init__(self):
self.step_count = 0
self.total_tokens = 0
self.total_cost = 0.0
self.start_time = time.time()
def check(self) -> tuple[bool, str]:
"""Returns (can_continue, reason_if_not)."""
if self.step_count >= MAX_STEPS:
return False, f"Hit step limit ({MAX_STEPS})"
if self.total_tokens >= MAX_TOKENS:
return False, f"Hit token limit ({MAX_TOKENS})"
if self.total_cost >= MAX_COST:
return False, f"Hit cost limit (${MAX_COST})"
elapsed = time.time() - self.start_time
if elapsed >= MAX_DURATION:
return False, f"Hit time limit ({MAX_DURATION}s)"
return True, ""Pitfall 5: Cost Explosions
Related to infinite loops, but broader. Every LLM call costs money. Every tool call might cost money (API fees, compute). Without budgeting, a single malfunctioning agent can destroy your monthly budget.
The fix: Token-level budgeting per task. We track input tokens, output tokens, and tool call costs separately. We alert at 50% budget, warn at 80%, and hard-stop at 100%.
| Guard | Threshold | Action |
|---|---|---|
| Step count | 25 steps | Terminate with summary |
| Token budget | 500K tokens | Terminate with summary |
| Cost budget | $5 per task | Terminate with summary |
| Wall clock | 5 minutes | Terminate with summary |
| Retry limit | 3 per tool | Skip tool, report failure |
| Error rate | >50% tool failures | Pause and alert human |
Real CODERCOPS Example: The Fintech Research Agent
Let me walk you through a real system we built. The client is a fintech company that needed to analyze market data, SEC filings, and news articles to generate daily research reports for their portfolio managers.
The Requirements
- Analyze 50-100 data sources daily
- Cross-reference information across sources
- Generate structured reports with citations
- Flag anomalies and significant changes
- Cost target: under $50/day in API costs
- Accuracy target: 95%+ on factual claims
The Architecture
We built a multi-agent system using LangGraph:
[Scheduler Agent]
|
├── [Data Collection Agent]
| ├── SEC Filing Tool
| ├── Market Data API
| └── News Search Tool
|
├── [Analysis Agent]
| ├── Cross-reference Tool
| ├── Anomaly Detection Tool
| └── Trend Analysis Tool
|
├── [Report Generation Agent]
| ├── Template Engine
| ├── Citation Formatter
| └── Chart Generator
|
└── [Quality Check Agent]
├── Fact Verification Tool
├── Consistency Checker
└── Human Review QueueWhat Worked
- The multi-agent split kept each agent focused. The Data Collection Agent had 4 tools, not 20.
- The Quality Check Agent caught 12% of factual errors before they reached portfolio managers.
- Checkpointing let us resume from the Analysis step when the Market Data API had a 2-hour outage, instead of re-running everything.
- Cost budgets kept daily spending to $35-45, well under the $50 target.
What We Had to Fix
- The Analysis Agent initially tried to analyze all 100 sources at once -- context window overflow. We switched to batch processing (10 sources at a time) with incremental summarization.
- Citation accuracy was initially 78% -- the agent would sometimes attribute findings to the wrong source. We fixed this by including source IDs in the structured output and validating them against the actual data.
- The Scheduler Agent would sometimes skip sources it deemed "not relevant" based on the headline. We added a rule that all sources must be at least skimmed before being excluded.
The Cost Breakdown
| Component | Daily Cost | % of Total |
|---|---|---|
| Data Collection (Claude Haiku) | $8.50 | 22% |
| Analysis (Claude Sonnet) | $18.00 | 46% |
| Report Generation (Claude Sonnet) | $6.00 | 15% |
| Quality Check (Claude Sonnet) | $4.50 | 12% |
| Infrastructure (AWS Lambda + S3) | $2.00 | 5% |
| Total | $39.00 | 100% |
Key cost optimization: We use Claude Haiku for data collection (high volume, low complexity) and Claude Sonnet for analysis and report generation (lower volume, high complexity). Using Sonnet for everything would have cost $85/day.
When NOT to Use Agents
This might be the most valuable section of this entire post. Not every problem needs an agent. In fact, most don't.
Use a simple prompt chain when:
- The steps are fixed and predictable
- There is no branching logic or decision-making
- Each step's output directly feeds the next step
- You do not need the system to "figure out" what to do next
Use a single LLM call when:
- The task fits in one prompt
- You are basically doing text transformation
- The context window is big enough for all your input
Use an agent when:
- The number and order of steps is unpredictable
- The system needs to make decisions based on intermediate results
- Tools might fail and the system needs to adapt
- Human judgment is needed at certain points
- The task involves interacting with multiple external systems
Simple prompt chain: "Summarize this document, then translate
to Spanish, then format as PDF"
→ 3 fixed steps, no decisions needed
Agent needed: "Research this company, determine if they're a
good acquisition target, flag any red flags, and
prepare a briefing. Use whatever sources and
analysis you need."
→ Unknown steps, decisions, multiple toolsA real example: A client came to us wanting an "AI agent" to process invoices. The workflow was: extract data from PDF, validate against their database, flag discrepancies, route for approval. Four steps, always the same, no decisions. We built it as a simple pipeline with structured extraction. Took 3 days instead of 3 weeks, costs 90% less to run, and is far more reliable than an agent would have been.
Our Production Checklist
Before we deploy any agent system, we run through this checklist:
- Every dangerous action has a human-in-the-loop checkpoint
- All outputs are validated against Pydantic schemas
- Every tool call has retry logic with exponential backoff
- Memory management with context pruning is implemented
- Graceful degradation paths exist for every tool failure
- Hard limits on steps, tokens, cost, and time are enforced
- No single agent has more than 10 tools
- Error rate monitoring and alerting is configured
- Cost tracking with per-task budgets is in place
- The system has been tested with adversarial inputs
- Logging captures full agent trajectories for debugging
- A "kill switch" exists to shut down the agent immediately
The Bigger Picture
AI agents are not magic. They are software systems with probabilistic components. The same engineering discipline that makes traditional software reliable -- error handling, validation, monitoring, testing, graceful degradation -- is what makes agents reliable.
The teams that are shipping successful agent systems are not the ones with the fanciest frameworks or the biggest models. They are the ones with the best engineering practices around those models.
If you are building agent systems and hitting walls, that is normal. We hit those same walls. The patterns in this post are how we got past them.
Ready to Build an Agent System That Actually Works?
At CODERCOPS, we have been through the pain of building production AI agent systems so our clients do not have to. Whether you need a research agent, a workflow automation agent, or a customer-facing AI system, we bring the patterns and guardrails that turn prototypes into production.
If you are evaluating agent architectures or struggling with reliability, let's talk. We will give you an honest assessment of whether you actually need an agent (many clients don't) and the fastest path to production if you do.
Check out our other posts on AI integration patterns for more production-tested approaches to building with LLMs.
Comments