Leaderboard
5 models · 300 total runs
| # | Model | Baseline | Treatment | Gain | Cost | Runs |
|---|---|---|---|---|---|---|
| 1 | zai:glm-4.6 | 90.0% | 86.7% | -3.3pp | $0.08 | 60 |
| 2 | claude-cli:claude-opus-4-1 | 93.3% | 86.7% | -6.7pp | $0.0000 | 60 |
| 3 | ollama:gemma4:e4b | 53.3% | 43.3% | -10.0pp | $0.0000 | 60 |
| 4 | zai:glm-4.6 | 86.7% | 0.0% | -86.7pp | $0.03 | 60 |
| 5 | claude-cli:claude-opus-4-1 | 90.0% | 0.0% | -90.0pp | $0.0000 | 60 |
Legend
- Baseline — model alone, no tools
- Treatment — model + agent toolkit (Innovation #14 thesis)
- Gain — Δ percentage points (treatment − baseline)
- Bars — Wilson 95% confidence interval (point estimate ± width)