About
BuzzBench is an open-source benchmark framework that fills a gap in the 2026 LLM evaluation landscape: no leaderboard measures the augmentation gain delivered by an agent toolkit.
Statistical rigor
- N ≥ 30 runs per scenario per mode (Wilson 95% CI mandate)
- Two-proportion z-test for significance (α = 0.05)
- Cohen's h for effect-size magnitude (trivial / small / medium / large)
- Multi-judge LLM-as-judge with Cohen's κ ≥ 0.6 inter-rater agreement
- Power analysis available via
powerTwoProportions()
Methodology
- Scenario authoring — YAML files in
BuzzBench/scenarios/<category>/<id>.yaml(ADR-037 Cat A storage local) - Harness execution —
@buzzbench/coreiterates scenarios × modes (baseline / treatment) × N runs - Multi-judge scoring —
@buzzbench/judgesaggregates regex / exact / LLM-judge with weighted score 0..10 - Statistical aggregation —
@buzzbench/leaderboardbuilds Wilson CI bars + z-test signals - Regression detection —
detectRegressions(current, previous)flags drops ≥ 5pp
Architecture (8 packages)
| Package | Role |
|---|---|
@buzzbench/core | Harness execution + tool bridge + observers |
@buzzbench/scenarios | YAML loader (zero-dep parser) + validator |
@buzzbench/statistics | Wilson CI, z-test, Cohen, power analysis |
@buzzbench/judges | regex / exact / LLM-as-judge + Cohen's κ |
@buzzbench/cost-tracker | USD pricing per provider (14 models baseline) |
@buzzbench/leaderboard | Aggregator + regression detector + 4 formatters |
@buzzbench/cli | bfw-bench validate / run / leaderboard |
@buzzbench/ui-dashboard | React components (table, CI bar, progress, alert) |
References
- ADR-049 — BuzzBench creation
- ADR-049b — App dedicated (couche 3 standalone)
- ADR-045 — Self-Tooling Agent Framework (Innovation #14)
- Wilson (1927) · Cohen (1988) · Acklam (2003) · Abramowitz & Stegun (1972) · Landis & Koch (1977) · Zheng et al. (2023, MT-Bench)