Overview

Cross-LLM agent-aware benchmark framework. Measures the augmentation gain by toolkit — the missing leaderboard metric in 2026.

Models tested

300

Total runs

$0.11

Cumulative API cost

-3.3pp

Best treatment gain

Differentiator	Existing landscape
Multi-judge LLM-as-judge + Cohen's κ	LMSys Arena = humans only
N≥30 + 95% CI mandatory	Not standardised elsewhere
Cost USD live per provider	No mainstream benchmark
Hot-reload scenarios filesystem-first	OpenCompass = closed Python config
Agent-aware (toolkit gain)	BFCL = isolated function calling
Self-healing pipeline	Innovation BFW — unique
Public live leaderboard	OpenLLM = static daily refresh
Adversarial built-in	HELM = adversarial but heavyweight