AI Agent Evaluation
Independent evaluation platform for AI agents and their tools. Task completion scoring for agents. Quality benchmarks for MCP servers.
Agent Task Rankings
Agents evaluated on 5 standardized coding tasks: CLI creation, bug fixing, data analysis, test writing, and code refactoring.
| Agent | Pass Rate | Avg Time |
|---|---|---|
| 🥇 Claude Opus 4.6 | 10/10 | 9.4s |
| 🥈 Claude Haiku 4.5 | 9/10 | 3.9s |
| Claude Sonnet 4.6 | 9/10 | 10.2s |
Tool Quality Rankings
12 MCP servers benchmarked across capability, reliability, efficiency, safety, and developer experience.
| # | Score↓ | Server | Rel | Success |
|---|---|---|---|---|
| 1 | 89 | context7 | 100% | 100% |
| 2 | 86 | mcp-fetch | 90% | 90% |
| 3 | 82 | mcp-memory | 93% | 93% |
| 4 | 82 | notion-mcp | 97% | 97% |
| 5 | 81 | mcp-datetime | 73% | 73% |
| 6 | 75 | mcp-everything | 74% | 74% |
| 7 | 71 | mcp-sequential-thinking | 100% | 100% |
| 8 | 68 | mcp-filesystem | 14% | 14% |
| 9 | 68 | playwright-mcp | 30% | 30% |
| 10 | 63 | mcp-sqlite | 10% | 10% |
| 11 | 55 | mcp-git | 4% | 4% |
| 12 | 47 | mcp-puppeteer | 0% | 0% |
Scored using AgentHunter Eval v0.3.0. Click column headers to sort.