What I Actually Learned Building with 5 Agentic AI Frameworks.
I explored OpenAI, CrewAI, LangGraph, Autogen, and MCP frameworks by building real projects. This post shares insights, pros, cons, and use cases to help you pick the right one.

What I Actually Learned Building with 5 Agentic AI Frameworks
TL;DR
I spent weeks building projects with five different Agentic AI frameworks: OpenAI Agents SDK, CrewAI, LangGraph, Autogen, and MCP. Here's the honest truth: there's no perfect framework. Each one clicked for different reasons.
OpenAI SDK and CrewAI were the easiest to pick up. I was productive almost immediately. LangGraph gave me the most control when I needed to map out exactly how my agent should think. MCP solved a real headache when I needed to plug into multiple external services. Autogen? Still wrapping my head around it, but its distributed architecture is seriously impressive for the right use case.
Bottom line: pick the framework that matches your problem, not the one with the most hype.
Introduction
"Which framework should I use?"
I kept asking myself this question when I started diving into agentic AI. The landscape is crowded, and everyone seems to have a strong opinion. So instead of reading more blog posts, I decided to actually build things.
Here's what I worked on:
- OpenAI Agents SDK - A deep research system with output guardrails
- CrewAI - A stock picker with a hierarchical team of agents
- LangGraph - A "Sidekick" assistant that browses the web and evaluates its own work
- Autogen - A world where agents create other agents dynamically
- MCP - A trading floor where AI traders connect to market data, accounts, and search services
What follows is what I actually learned—the good parts, the frustrating bits, and when you'd actually want to use each one.
1. OpenAI Agents SDK — When You Just Want It to Work
The Project: Deep Research with Guardrails
I built a research assistant that plans searches, runs them in parallel, and writes up findings. The twist? It has guardrails that check the output quality before returning results.
What the Code Looks Like
Setting up an agent is dead simple:
from agents import Agent, WebSearchTool, ModelSettings
search_agent = Agent(
name="Search agent",
instructions=INSTRUCTIONS,
tools=[WebSearchTool(search_context_size="low")],
model="gpt-4o-mini",
model_settings=ModelSettings(tool_choice="required"),
)That's it. No ceremony, no boilerplate maze. I defined what I wanted, and it worked.
For structured outputs, Pydantic plays nicely:
from pydantic import BaseModel, Field
class WebSearchItem(BaseModel):
reason: str = Field(description="Why this search matters.")
query: str = Field(description="The actual search term.")
class WebSearchPlan(BaseModel):
searches: list[WebSearchItem]
planner_agent = Agent(
name="PlannerAgent",
instructions=INSTRUCTIONS,
model="gpt-4o-mini",
output_type=WebSearchPlan, # Type-safe outputs!
)The Guardrail Feature Stood Out
This is where things got interesting. I could attach guardrails that validate outputs before they're returned:
from agents import output_guardrail, GuardrailFunctionOutput, RunContextWrapper
@output_guardrail
async def guardrail_output_length(ctx: RunContextWrapper, agent: Agent, output: ReportData) -> GuardrailFunctionOutput:
result = await Runner.run(guardrail_agent, output.markdown_report, context=ctx.context)
return GuardrailFunctionOutput(
output_info={"output_too_long": result.final_output.reason},
tripwire_triggered=False
)
writer_agent = Agent(
name="WriterAgent",
output_type=ReportData,
output_guardrails=[guardrail_output_length]
)I used this to check if reports were too long, but you could validate anything, tone, accuracy, or safety. It's baked into the framework, not bolted on.
The Good Stuff
- Fast to learn — I was building within an hour
- Structured outputs — Pydantic integration means type safety without extra work
- Guardrails — Built-in quality control with tripwire mechanisms
- Tracing — Debugging with trace IDs saved me hours
- Async-first — Parallel operations just work
The Rough Edges
- You're in OpenAI's world — Primarily designed for their models
- Fewer knobs to turn — When I wanted fine control, I hit limits
- Moving target — The SDK is newer, docs sometimes lag
When to Reach for It
If you need a research assistant, content generator, or anything that benefits from output validation, and you're already using OpenAI, this framework gets out of your way.
2. CrewAI — Thinking in Teams
The Project: Stock Picker
I wanted a system where different agents have different jobs: one finds trending companies, another researches them, and a final one picks the best investment. CrewAI's "crew" metaphor made this natural to build.
What the Code Looks Like
CrewAI uses decorators and YAML config, which keeps things organized:
from crewai import Agent, Crew, Process, Task
from crewai.project import CrewBase, agent, crew, task
@CrewBase
class StockPicker:
agents_config = "config/agents.yaml"
tasks_config = "config/tasks.yaml"
@agent
def trending_company_finder(self) -> Agent:
return Agent(
config=self.agents_config["trending_company_finder"],
tools=[SerperDevTool()],
memory=True,
)
@agent
def financial_researcher(self) -> Agent:
return Agent(
config=self.agents_config["financial_researcher"],
tools=[SerperDevTool()]
)
@task
def find_trending_companies(self) -> Task:
return Task(
config=self.tasks_config["find_trending_companies"],
output_pydantic=TrendingCompanyList,
)YAML Config Changed How I Worked
Being able to tweak agent personalities without touching Python felt liberating:
trending_company_finder:
role: > Financial News Analyst that finds trending companies in {sector} goal: > You read the latest news, then find 2-3 companies that are trending. backstory: > You are a market expert with a knack for picking out interesting companies. llm: openai/gpt-4o-mini
I could iterate on prompts fast, and non-technical teammates could understand what each agent does.
The Memory System is No Joke
This blew me away. CrewAI has three types of memory built in:
from crewai.memory import LongTermMemory, ShortTermMemory, EntityMemory
from crewai.memory.storage.rag_storage import RAGStorage
from crewai.memory.storage.ltm_sqlite_storage import LTMSQLiteStorage
return Crew(
agents=self.agents,
tasks=self.tasks,
process=Process.hierarchical,
manager_agent=manager,
memory=True,
long_term_memory=LongTermMemory(
storage=LTMSQLiteStorage(db_path="./memory/long_term_memory_storage.db")
),
short_term_memory=ShortTermMemory(
storage=RAGStorage(
embedder_config={"provider": "openai", "config": {"model": "text-embedding-3-small"}},
type="short_term",
path="./memory/",
)
),
entity_memory=EntityMemory(
storage=RAGStorage(
embedder_config={"provider": "openai", "config": {"model": "text-embedding-3-small"}},
type="short_term",
path="./memory/",
)
),
)Long-term memory persists across sessions. Short-term uses RAG. Entity memory tracks key information about people, companies, and concepts. This is enterprise-grade stuff.
The Good Stuff
- Role-based thinking — Matches how we actually organize teams
- YAML separation — Config lives apart from logic
- Hierarchical processes — Manager-worker patterns built in
- Memory that actually works — Long-term, short-term, entity—all covered
- Solid tool ecosystem — SerperDev and others ready to go
The Rough Edges
- Opinionated — Fighting the structure gets painful
- Debugging hierarchies — When the manager delegates incorrectly, good luck
- Resource hungry — Memory systems add overhead
When to Reach for It
If you're building something that naturally maps to a team of analysts, researchers, writers, and reviewers, CrewAI makes sense. Especially if you need agents to remember things across sessions.
3. LangGraph — When You Need a Map
The Project: Sidekick Assistant
I built an assistant that browses the web, runs Python code, and here's the key part: evaluates its own work and loops back if it's not good enough. LangGraph's graph structure made this feedback loop explicit.
What the Code Looks Like
You literally define nodes and edges:
from langgraph.graph import StateGraph, START, END
class State(TypedDict):
messages: Annotated[List[Any], add_messages]
success_criteria: str
feedback_on_work: Optional[str]
success_criteria_met: bool
user_input_needed: bool
async def build_graph(self):
graph_builder = StateGraph(State)
# Nodes = things that happen
graph_builder.add_node("worker", self.worker)
graph_builder.add_node("tools", ToolNode(tools=self.tools))
graph_builder.add_node("evaluator", self.evaluator)
# Edges = how we move between them
graph_builder.add_conditional_edges(
"worker",
self.worker_router,
{"tools": "tools", "evaluator": "evaluator"}
)
graph_builder.add_edge("tools", "worker")
graph_builder.add_conditional_edges(
"evaluator",
self.route_based_on_evaluation,
{"worker": "worker", "END": END}
)
graph_builder.add_edge(START, "worker")
self.graph = graph_builder.compile(checkpointer=self.memory)Reading this, I can see the flow. The worker does something, maybe uses tools, then the evaluator checks it. If it's not good, back to the worker. If it's good, we're done.
Self-Evaluation Was the Killer Feature
The evaluator pattern changed how I think about agent quality:
class EvaluatorOutput(BaseModel):
feedback: str = Field(description="What needs improvement")
success_criteria_met: bool
user_input_needed: bool
def evaluator(self, state: State) -> State:
eval_result = self.evaluator_llm_with_output.invoke(evaluator_messages)
return {
"messages": [...],
"feedback_on_work": eval_result.feedback,
"success_criteria_met": eval_result.success_criteria_met,
"user_input_needed": eval_result.user_input_needed,
}The agent doesn't just try once; it keeps going until it meets the criteria or needs human help.
Tool Integration Was Rich
Since LangGraph sits on LangChain, I got access to everything:
async def playwright_tools():
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)
toolkit = PlayWrightBrowserToolkit.from_browser(async_browser=browser)
return toolkit.get_tools(), browser, playwright
async def other_tools():
return [
Tool(name="send_push_notification", func=push, description="..."),
Tool(name="search", func=serper.run, description="..."),
PythonREPLTool(),
WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
]Browser automation, search, Python execution, and Wikipedia are all plugged in.
The Good Stuff
- Visual thinking — I could draw the flow, then code it
- Total control — Every state transition is explicit
- Built-in checkpointing — Conversation memory just works
- Conditional routing — Dynamic paths based on state
- LangChain ecosystem — Huge tool library available
- Testable — Each node works independently
The Rough Edges
- Conceptual overhead — You need to grok state machines
- More typing — It's verbose compared to simpler frameworks
- Learning curve — Took me longer to feel comfortable
When to Reach for It
If your agent needs to make decisions, branch, loop back, or follow a specific workflow, especially one you need to visualize or debug, LangGraph is worth the extra setup.
4. Autogen — Agents Creating Agents
The Project: Creative Agent World
This one was wild. I built a system where a "Creator" agent generates new agents at runtime. These agents then collaborate on business ideas, sometimes bouncing ideas off each other randomly.
What the Code Looks Like
Autogen uses message passing between routed agents:
from autogen_core import MessageContext, RoutedAgent, message_handler
from autogen_agentchat.agents import AssistantAgent
class Agent(RoutedAgent):
system_message = """
You are a creative entrepreneur. Your interests: Healthcare, Education.
You're drawn to disruptive ideas. You're optimistic but impatient.
"""
CHANCES_THAT_I_BOUNCE_IDEA_OFF_ANOTHER = 0.5
def __init__(self, name) -> None:
super().__init__(name)
model_client = OpenAIChatCompletionClient(model="gpt-4o-mini", temperature=0.7)
self._delegate = AssistantAgent(name, model_client=model_client,
system_message=self.system_message)
@message_handler
async def handle_message(self, message: Message, ctx: MessageContext) -> Message:
response = await self._delegate.on_messages([text_message], ctx.cancellation_token)
idea = response.chat_message.content
# 50% chance to collaborate
if random.random() < self.CHANCES_THAT_I_BOUNCE_IDEA_OFF_ANOTHER:
recipient = messages.find_recipient()
response = await self.send_message(Message(content=message), recipient)
idea = response.content
return Message(content=idea)Dynamic Agent Creation Blew My Mind
The Creator agent writes Python code for new agents, saves it, imports it, and registers it all at runtime:
class Creator(RoutedAgent):
@message_handler
async def handle_my_message_type(self, message: Message, ctx: MessageContext) -> Message:
filename = message.content
agent_name = filename.split(".")[0]
# LLM generates agent code
response = await self._delegate.on_messages([text_message], ctx.cancellation_token)
# Write it to disk
with open(filename, "w", encoding="utf-8") as f:
f.write(response.chat_message.content)
# Import and register dynamically
module = importlib.import_module(agent_name)
await module.Agent.register(self.runtime, agent_name, lambda: module.Agent(agent_name))
# Get an idea from the new agent
result = await self.send_message(Message(content="Give me an idea"),
AgentId(agent_name, "default"))
return Message(content=result.content)Agents creating agents. Each with unique personality. Collaborating randomly. It's chaos in the best way.
Distributed by Design
from autogen_ext.runtimes.grpc import GrpcWorkerAgentRuntimeHost
async def main():
host = GrpcWorkerAgentRuntimeHost(address="localhost:50051")
host.start()
worker = GrpcWorkerAgentRuntime(host_address="localhost:50051")
await worker.start()
await Creator.register(worker, "Creator", lambda: Creator("Creator"))
coroutines = [create_and_message(worker, creator_id, i)
for i in range(1, HOW_MANY_AGENTS+1)]
await asyncio.gather(*coroutines)gRPC runtime, parallel agent creation, and distributed execution. This is built for scale.
The Good Stuff
- True distribution — gRPC-based, scales horizontally
- Dynamic creation — Agents spawning agents
- Clean message model — Inter-agent communication is well-designed
- Built for scale — Large multi-agent systems are the use case
- Unique personalities — Each agent can be genuinely different
The Rough Edges
- Steep climb — RoutedAgent, MessageContext, handlers—lots to learn
- Infrastructure weight — gRPC setup isn't trivial
- Docs catching up — Not as mature as others
- Debugging distributed stuff — Tracing across agents is hard
When to Reach for It
Honestly? I'm still figuring this out. If you need dozens of agents talking to each other across machines, Autogen is built for that. For simpler cases, it's probably overkill.
5. MCP — The Universal Plug
The Project: AI Trading Floor
I built a trading system where multiple AI traders connect to market data (Polygon.io), account services, search (Brave), and memory (LibSQL)—all through MCP servers. Different traders even use different LLMs.
What the Code Looks Like
MCP standardizes how agents connect to external tools:
from agents.mcp import MCPServerStdio
trader_mcp_server_params = [
{"command": "uv", "args": ["run", "accounts_server.py"]},
{"command": "uv", "args": ["run", "push_server.py"]},
market_mcp, # Polygon.io
]
def researcher_mcp_server_params(name: str):
return [
{"command": "uvx", "args": ["mcp-server-fetch"]},
{
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-brave-search"],
"env": {"BRAVE_API_KEY": os.getenv("BRAVE_API_KEY")},
},
{
"command": "npx",
"args": ["-y", "mcp-memory-libsql"],
"env": {"LIBSQL_URL": f"file:./memory/{name}.db"},
},
]Notice the pattern? Each tool is an MCP server that could be Python (uv run), could be Node (npx), doesn't matter. The protocol is the same.
Connecting Agents to MCP Servers
class Trader:
async def run_with_mcp_servers(self):
async with AsyncExitStack() as stack:
trader_mcp_servers = [
await stack.enter_async_context(
MCPServerStdio(params, client_session_timeout_seconds=120)
)
for params in trader_mcp_server_params
]
async with AsyncExitStack() as stack:
researcher_mcp_servers = [
await stack.enter_async_context(
MCPServerStdio(params, client_session_timeout_seconds=120)
)
for params in researcher_mcp_server_params(self.name)
]
await self.run_agent(trader_mcp_servers, researcher_mcp_servers)
async def create_agent(self, trader_mcp_servers, researcher_mcp_servers) -> Agent:
tool = await get_researcher_tool(researcher_mcp_servers, self.model_name)
self.agent = Agent(
name=self.name,
instructions=trader_instructions(self.name),
model=get_model(self.model_name),
tools=[tool],
mcp_servers=trader_mcp_servers,
)
return self.agentMulti-Model Support Was a Nice Bonus
Different traders can use different LLMs:
def get_model(model_name: str):
if "/" in model_name:
return OpenAIChatCompletionsModel(model=model_name, openai_client=openrouter_client)
elif "deepseek" in model_name:
return OpenAIChatCompletionsModel(model=model_name, openai_client=deepseek_client)
elif "grok" in model_name:
return OpenAIChatCompletionsModel(model=model_name, openai_client=grok_client)
elif "gemini" in model_name:
return OpenAIChatCompletionsModel(model=model_name, openai_client=gemini_client)
else:
return model_name
model_names = ["gpt-4.1-mini", "deepseek-chat", "gemini-2.5-flash-preview", "grok-3-mini-beta"]Four traders, four different models, all using the same tools through MCP.
The Good Stuff
- Open standard — Not locked to any vendor
- Growing ecosystem — Brave, Polygon, fetch, memory servers ready to use
- Model freedom — Swap LLMs without changing tool code
- Clean boundaries — Tools are separate processes
- npm/uvx packages — Easy installation for common services
The Rough Edges
- Process overhead — Each server is its own process
- Setup complexity — Juggling multiple servers gets messy
- Debugging across processes — Harder than in-process tools
- Still maturing — Protocol evolving, expect changes
When to Reach for It
When you need to plug into multiple external services and want flexibility on which LLM to use. MCP acts like a universal adapter; once a tool speaks MCP, any agent can use it.
| What Matters | OpenAI SDK | CrewAI | LangGraph | Autogen | MCP |
|---|---|---|---|---|---|
| Time to first agent | Fast | Fast | Medium | Slow | Medium |
| Control over flow | Low | Medium | High | High | High |
| Memory support | Basic | Excellent | Good | Basic | Via Servers |
| Tool options | Good | Good | Excellent | Good |
My Rankings (After Actually Building Stuff)
🥇 Tied for First: OpenAI Agents SDK & CrewAI
Both got me productive fast. OpenAI SDK's guardrails and type safety felt polished. CrewAI's role-based thinking and memory system impressed me. If I need something working by the end of the day, I'd reach for one of these.
🥈 Second: LangGraph
The graph approach took longer to click, but once it did, I had so much control. For complex workflows where I need to visualize and debug the agent's decision path, nothing else comes close.
🥉 Third: MCP
Not a framework per se, but a protocol that changes how I think about tools. The ability to mix and match LLMs while sharing the same tools is powerful. As more MCP servers appear, this gets more valuable.
Fourth: Autogen
I'm being honest, I haven't fully cracked this one yet. The distributed architecture and dynamic agent creation are genuinely innovative, but the learning curve is real. For large-scale multi-agent systems, it's probably the right choice. I just need more time with it.
Picking Your Framework
Here's my mental model:
Need something working fast? → OpenAI SDK or CrewAI
Building a team with distinct roles? → CrewAI
Complex workflow with branches and loops? → LangGraph
Multiple external services? → Use MCP alongside your framework
Distributed agents at scale? → Autogen (and clear your calendar)
Mixing different LLMs? → MCP makes this painless
Final Thoughts
None of these frameworks is "the best." They're different tools for different problems.
What surprised me most: the skills transfer. Learning LangGraph's state management helped me think about CrewAI's task flows. Understanding MCP's tool protocol made me appreciate how OpenAI SDK handles tools under the hood.
So pick one and build something. You'll learn more from a weekend project than from reading ten more comparison posts (including this one).
All the code I referenced is on GitHub. Poke around, break things, make them better.
→ Browse the full project on GitHub
Special thanks to Edward Donner for teaching and providing the materials for this course. Connect with him on LinkedIn.
Written after actually building with these frameworks. Your experience might differ. These tools evolve fast—check the official docs for the latest.