Overview
Cross-LLM agent-aware benchmark framework. Measures the augmentation gain by toolkit — the missing leaderboard metric in 2026.
5
Models tested
300
Total runs
$0.11
Cumulative API cost
-3.3pp
Best treatment gain
What sets BuzzBench apart
| Differentiator | Existing landscape |
|---|---|
| Multi-judge LLM-as-judge + Cohen's κ | LMSys Arena = humans only |
| N≥30 + 95% CI mandatory | Not standardised elsewhere |
| Cost USD live per provider | No mainstream benchmark |
| Hot-reload scenarios filesystem-first | OpenCompass = closed Python config |
| Agent-aware (toolkit gain) | BFCL = isolated function calling |
| Self-healing pipeline | Innovation BFW — unique |
| Public live leaderboard | OpenLLM = static daily refresh |
| Adversarial built-in | HELM = adversarial but heavyweight |
Explore
- 🏆 Leaderboard — ranked models with 95% CI bars
- 📋 Scenarios — 30 scenarios across 4 categories
- ⚠️ Regressions — automated detection vs previous run
- ℹ️ About — methodology, ADR-049, statistical rigor