About

BuzzBench is an open-source benchmark framework that fills a gap in the 2026 LLM evaluation landscape: no leaderboard measures the augmentation gain delivered by an agent toolkit.

Statistical rigor

  • N ≥ 30 runs per scenario per mode (Wilson 95% CI mandate)
  • Two-proportion z-test for significance (α = 0.05)
  • Cohen's h for effect-size magnitude (trivial / small / medium / large)
  • Multi-judge LLM-as-judge with Cohen's κ ≥ 0.6 inter-rater agreement
  • Power analysis available via powerTwoProportions()

Methodology

  1. Scenario authoring — YAML files in BuzzBench/scenarios/<category>/<id>.yaml (ADR-037 Cat A storage local)
  2. Harness execution@buzzbench/core iterates scenarios × modes (baseline / treatment) × N runs
  3. Multi-judge scoring@buzzbench/judges aggregates regex / exact / LLM-judge with weighted score 0..10
  4. Statistical aggregation@buzzbench/leaderboard builds Wilson CI bars + z-test signals
  5. Regression detectiondetectRegressions(current, previous) flags drops ≥ 5pp

Architecture (8 packages)

PackageRole
@buzzbench/coreHarness execution + tool bridge + observers
@buzzbench/scenariosYAML loader (zero-dep parser) + validator
@buzzbench/statisticsWilson CI, z-test, Cohen, power analysis
@buzzbench/judgesregex / exact / LLM-as-judge + Cohen's κ
@buzzbench/cost-trackerUSD pricing per provider (14 models baseline)
@buzzbench/leaderboardAggregator + regression detector + 4 formatters
@buzzbench/clibfw-bench validate / run / leaderboard
@buzzbench/ui-dashboardReact components (table, CI bar, progress, alert)

References

License

MIT — see LICENSE. Issues and PRs welcome on GitHub.