How the model works
Plain-English transparency on what predicts every fixture you see.
The pipeline at a glance
Each prediction goes through three stages. The first two produce probabilities; the third nudges a small subset of fixtures based on a league-cultural signal.
- v0.1 — Dixon-Coles base. A statistical model that estimates each team's attack strength and defence weakness from historical scorelines, weighted toward recent matches.
- v0.2 — XGBoost stacker. A gradient-boosted model that takes
v0.1's output as a starting point, then adjusts it using ~77 context features
for that specific fixture. Two cohort-specific stackers:
- top cohort — Premier League, Bundesliga, Serie A, La Liga, Ligue 1
- english cohort — Championship, League One, League Two
- v0.3 — draw post-processor. Only fires on coin-flip fixtures (where two outcomes are within ~8% of each other) in the defensive/parity leagues — La Liga, Serie A, Ligue 1, League One. Uses recent form, blown-lead rate, and league position to decide whether the fixture is genuinely draw-prone.
- v0.4 — cross-season residual features. Same architecture as v0.3, stacker retrained with 10 extra features that capture rolling 10-match attack and defensive xG residuals (over- or under- finishing relative to xG, including hot-keeper signals). The window backfills from prior season when current-season match count is below 10, so early-season cold starts use last-season's tail. Captures the Bournemouth/Chelsea finishing-quality phenomenon that Elo can't see on its own.
Important features (plain English)
XGBoost ranks features by how much they reduce error. Here are the ones that drive the most lift, translated into football terms.
- Elo rating differential
- The classic chess-style rating, adapted for football: how much stronger one team is than the other right now. Updates after every match. Includes home advantage (~80 Elo points). Top driver across leagues.
- Expected goals (xG) — last 5 / 10 matches
- The quality of chances each team has been creating and conceding lately. Better than goals because it strips out luck — a team can lose 3-0 having generated more xG than the opponent.
- Form points (last 5 / 10)
- Total points won recently (3 win, 1 draw, 0 loss). Standard but rolling windows of 5 and 10 give the model both short and medium-term pictures.
- Form trajectory (5 vs 10)
- The direction of travel — is this team accelerating up or sliding down? Computed as last 5 minus matches 6-10. Captures Bournemouth-style mid-season turnarounds.
- Key player absences
- How many of the team's regular starters are unavailable, weighted by how recently they were starting. Time-correctly snapshotted so we don't peek at injury news that came after kickoff.
- Manager tenure
- How long the current head coach has been in charge. Brand-new coaches dampen the model's confidence in form features (uncertainty principle).
- Lineup strength
- When the actual starting XI is available pre-match, we score it against each player's season-long performance. Strong XI vs weak XI shifts the probability accordingly.
- League position context
- Top-6, mid-table, or bottom-6 — used by the draw post-processor in PD/SA/FL1 where bottom-6 teams that have been blowing leads are statistically more draw-prone.
- Bottle coefficient
- A team's tendency to blow late leads. Computed from goal-by-goal timing data across recent matches. Currently used only by the post-processor — bench tested inside v0.2 itself, where it didn't add value.
- Attack residual (last 10, cross-season) v0.4
- Σ (goals scored − xG) averaged over the team's last 10 matches, including carry-over from prior season when current count is below 10. Positive means a team has been over-finishing recently (Bournemouth-type clinical streaks); negative means under-finishing (Chelsea-type wasteful spells). Catches what Elo and goal-totals miss.
- Defensive residual (last 10, cross-season) v0.4
- Σ (xG against − goals conceded), same window. Positive means the team has been conceding fewer goals than xG suggests (good defending or hot keeper); negative means leakier than xG implies. Spots persistent goalkeeper form and last-ditch-defending streaks.
- Bookmaker odds (when available)
- Used for the "spicy pick" — finding longer-priced outcomes the model rates higher than the market. Not used by the main outcome prediction.
How we measure ourselves
- Outcome accuracy — correct W/D/L pick, walk-forward eval (no future data).
- Brier score — penalises overconfident wrong predictions; lower is better.
- Walk-forward validation — the model is retrained at every matchday using only data available at that point in time. No leakage.
What we deliberately don't do
- No team hardcoding — we don't tell the model that "Manchester City is good". It learns from the data.
- No manager-name features — we use tenure (how long), not identity.
- No betting integration — picks are educational; calibration matters more than accuracy if we ever bet.
- No proprietary tracking data — we use commercial-API stats (API-Football), not StatsBomb event data or Hawk-Eye traces.
Model changelog & transparency
Each fixture in the history page carries the model variant that produced
its pick (look for the (i) hover popup). Once a fixture
kicks off, its prediction is locked — future model upgrades never
retroactively change historical picks.
v0.6 — current production (May 2026)
What changed: Same v0.5 hybrid architecture (XGBoost + LogisticRegression blend) plus 10 nothing-to-play-for binary flags per fixture and a continuous season_phase axis. The flags fire when a team is mathematically locked into title-won, relegated, auto-promoted, playoff-locked-no-auto, or safe-no-climb — encoding end-of-season motivation that bookmakers often misprice.
- Why: Late-season fixtures behave differently from mid-season ones — title-locked teams rotate, relegated teams give up, etc. v0.5 had no explicit signal for this. The flags fire on ~5% of all fixtures and concentrate in the last 5–8 matchdays of each season.
- Walk-forward backtest (2024 + 2025, n=6,607 fixtures): Brier per outcome 0.2042 → 0.2042 (calibration unchanged); accuracy 49.74% → 49.72%. Picks-engine PnL £326.60 → £477.30 on £2,580 staked = ROI +12.91% → +18.50%. The lift is concentrated where the flags fire (flagged-stratum ROI +63.96% → +84.52%) and bleeds through to a small global ROI gain because the flags shift probability mass into better picks-engine targets.
- Honesty note: v0.6's lift is stake-targeting rather than calibration. Brier doesn't move; the model picks a marginally better set of fixtures to bet on. That's where the £150 PnL improvement comes from on the same fixture pool.
- Architecture: Unchanged from v0.5. The flags are inputs to the existing XGB+LR stacker, not a new layer. ADR 0001 still holds.
v0.5 — hybrid stacker (May 6 2026 → May 7 2026)
What changed: Replaced the single XGBoost stacker with an XGBoost + LogisticRegression hybrid blended by live league-relative dominance (ppm_z). LR extrapolates linearly past XGB's response-curve plateau — rescues the model on extreme fixtures (Bayern-class dominance) where trees compress predictions.
- Why: XGBoost's tree splits don't extrapolate beyond the training range. Top-league dominators (Bayern, City) get under-rated when their ppm_z is far above any team in the training distribution. LR's sigmoid kept extrapolating; blending recovered the lost confidence.
- Backtest: Walk-forward 2024 + 2025 Brier per outcome 0.2054 → 0.2042 (-0.0012). Picks-engine ROI -4.31% → +12.91% — but most of the ROI gain is the threshold change (v0.4 used flat +10% edge; v0.5 introduced tiered +50% safe / +75% extreme), not the model.
- Lifespan: One day in production before v0.6 promoted on top.
v0.4 — cross-season residual stacker (May 2026)
What changed: Same 2-stage architecture as v0.3 (DC base + XGB stacker + draw post-processor rule). Stacker retrained with 10 extra cross-season residual features — rolling 10-match attack residual (goals − xG) and defensive residual (xGA − goals against), per team, with prior-season backfill when current count is below 10.
- Why: The previous stacker had no signal for finishing quality independent of Elo. A team like Bournemouth converting beyond xG, or Chelsea consistently under-finishing, looked the same to the model as their long-run strength suggested. The residual features expose the gap directly.
- Backtest: Cohort-aware walk-forward eval (2024 + 2025) showed top cohort +0.23/+0.00pp, english cohort +0.67/+0.48pp. Net +1.38pp summed across cohort×year cells, zero regressions.
- Architecture: Unchanged from v0.3. ADR 0001 locks the 2-stage pattern through Beta. Future feature additions go INTO the stacker, not as new layers.
v0.3 — production May 2026 (briefly)
What changed: Cohort-split XGBoost stackers (top vs english) replaced a single 8-league stacker, plus a league-targeted draw post-processor.
- Why: A single 8-league stacker dragged top-league accuracy down because the noisier English Championship/L1/L2 data competed for the same model capacity. Splitting recovered ~0.5pp on top leagues without losing the lower-tier coverage.
- Behavior: Top stacker trained on PL/BL1/SA/PD/FL1; english stacker on ELC/EL1/EL2. Each fixture dispatched by competition.
- Draw post-processor: Fires only on coin-flip fixtures in PD/SA/FL1/EL1 where both teams have been blowing leads and sit in the bottom half. Cohort analysis showed those leagues genuinely convert chaos into draws (+8pp draw rate), but PL/BL1/ELC do the opposite — adding the rule there hurt accuracy. Targeted, principled, not arbitrary.
- Trade-off honestly noted: Marginal pooled lift (~0.4pp) over v0.1. The gain is concentrated in BL1/PD/FL1 where the stacker has signal; PL is roughly flat. Brier score improved (~0.198 vs 0.205).
v0.2 — single-cohort stacker (deprecated same-day)
What changed: First stacker ship — one XGBoost model on top of v0.1, trained across all 8 leagues with one shared model.
- Why retired: Per-league eval revealed the lower English tiers were diluting top-league signal. Replaced within hours by the cohort-split v0.3 above.
v0.1 — Dixon-Coles base (Feb 2026 → May 2026)
What it did: Per-league Dixon-Coles fit on attack/defence ratings, blended with a cross-league XGBoost on rolling features (60/40), then wrapped with a multinomial logistic calibration on prior backtest pairs.
- Walk-forward accuracy: ~50% on PL-2025.
- Strong on home/away picks, weak on draws (recall ~5%).
- Now serves as the base layer that v0.2 stacks on top of.
What we deliberately tried and rejected
- In-model bottle features: Added 5 lead-blown features into the XGBoost stacker. Cost -0.31pp on 2024 — capacity ceiling. Dropped, signal lives in the post-processor instead.
- Universal draw post-processor (no league hardcode): Tried applying the bottle/form/position rule across all leagues. Lost -1pp on PL. League IS the discriminator — the cultural pattern doesn't generalise.
- Tighter coin-flip thresholds (β=2, thr=0.7): Looked good on 2025, broke 2024 by -3.7pp on PD. Settled on conservative defaults that work on both years.
What's next
- Per-cohort hyperparameter tuning — the eng cohort may need different tree depth/regularisation than top.
- Focal loss / weighted XGBoost — academic literature flags this as the highest-yield untried approach for the underprediction-of-draws problem.
- Bayesian network for mold-breakers — better handling of mid-season team-personality flips (Bournemouth turnaround, Spurs collapse).
Two-stage stack inspired by Dixon-Coles (1997) plus modern gradient boosting.