Regressions
Automatic detection of significant drops vs previous run (z-test 2 proportions, α=0.05).
✅ No regressions detected (≥ 2 historical leaderboards required for diff).
Once BuzzBench/results/ contains multiple dated leaderboards, this page will diff the most recent vs previous and surface any model losing ≥ 5 percentage points (configurable via thresholdPP in detectRegressions()).
How regression detection works
- Each
bfw-bench runproduces a timestampedBenchmarkResult.json - The leaderboard aggregator builds a snapshot per timestamp
detectRegressions(current, previous)diffs each model on:baseline-successratetreatment-successratetreatment-gain(Δpp)
- Drops ≥
thresholdPPAND statistically significant (p < α) are flagged