Leaderboard

5 models · 300 total runs

Sorted by treatment-gain · generated 5/9/2026, 11:30:08 PM
#ModelBaselineTreatmentGainCostRuns
1zai:glm-4.6
90.0%
86.7%
-3.3pp$0.0860
2claude-cli:claude-opus-4-1
93.3%
86.7%
-6.7pp$0.000060
3ollama:gemma4:e4b
53.3%
43.3%
-10.0pp$0.000060
4zai:glm-4.6
86.7%
0.0%
-86.7pp$0.0360
5claude-cli:claude-opus-4-1
90.0%
0.0%
-90.0pp$0.000060

Legend

  • Baseline — model alone, no tools
  • Treatment — model + agent toolkit (Innovation #14 thesis)
  • Gain — Δ percentage points (treatment − baseline)
  • Bars — Wilson 95% confidence interval (point estimate ± width)