Leaderboard · BuzzBench

Leaderboard

5 models · 300 total runs

By treatment gain By baseline By treatment By cost

Sorted by **treatment-gain** · *generated 7/8/2026, 10:58:13 PM*
#	Model	Baseline	Treatment	Gain	Cost	Runs
1	`zai:glm-4.6`	90.0%	86.7%	-3.3pp	$0.08	60
2	`claude-cli:claude-opus-4-1`	93.3%	86.7%	-6.7pp	$0.0000	60
3	`ollama:gemma4:e4b`	53.3%	43.3%	-10.0pp	$0.0000	60
4	`zai:glm-4.6`	86.7%	0.0%	-86.7pp	$0.03	60
5	`claude-cli:claude-opus-4-1`	90.0%	0.0%	-90.0pp	$0.0000	60

Legend

Baseline — model alone, no tools
Treatment — model + agent toolkit (Innovation #14 thesis)
Gain — Δ percentage points (treatment − baseline)
Bars — Wilson 95% confidence interval (point estimate ± width)