Regressions

Automatic detection of significant drops vs previous run (z-test 2 proportions, α=0.05).

✅ No regressions detected (≥ 2 historical leaderboards required for diff).

Once BuzzBench/results/ contains multiple dated leaderboards, this page will diff the most recent vs previous and surface any model losing ≥ 5 percentage points (configurable via thresholdPP in detectRegressions()).

How regression detection works

  1. Each bfw-bench run produces a timestamped BenchmarkResult.json
  2. The leaderboard aggregator builds a snapshot per timestamp
  3. detectRegressions(current, previous) diffs each model on:
    • baseline-success rate
    • treatment-success rate
    • treatment-gain (Δpp)
  4. Drops ≥ thresholdPP AND statistically significant (p < α) are flagged