Overview

Cross-LLM agent-aware benchmark framework. Measures the augmentation gain by toolkit — the missing leaderboard metric in 2026.

5
Models tested
300
Total runs
$0.11
Cumulative API cost
-3.3pp
Best treatment gain

What sets BuzzBench apart

DifferentiatorExisting landscape
Multi-judge LLM-as-judge + Cohen's κLMSys Arena = humans only
N≥30 + 95% CI mandatoryNot standardised elsewhere
Cost USD live per providerNo mainstream benchmark
Hot-reload scenarios filesystem-firstOpenCompass = closed Python config
Agent-aware (toolkit gain)BFCL = isolated function calling
Self-healing pipelineInnovation BFW — unique
Public live leaderboardOpenLLM = static daily refresh
Adversarial built-inHELM = adversarial but heavyweight