Model Validation7 min readDecember 4, 2024

Twelve Folds of Evidence: Walk-Forward Validation That Earns Trust

Anchored walk-forward optimization with expanding window confirming OOS performance within 13% of IS across 12 independent folds.

Walk-ForwardCross-ValidationOut-of-Sample

Beyond Simple Train/Test Splits

A single train/test split proves nothing. Maybe your test set happened to contain a regime that matches your training set. Maybe it avoided the hard periods. S24 uses anchored walk-forward validation with 12 folds to test the strategy across every possible market condition in the data.

Each fold uses an expanding training window (anchored at the start of the dataset) and a fixed 6-month test window. Fold 1 trains on months 1-12 and tests on 13-18. Fold 2 trains on months 1-18 and tests on 19-24. This expanding window ensures the model always has at least as much data as it would in production.

OOS Within 13% of IS

The critical metric is the IS-to-OOS degradation ratio. Perfect transfer would mean OOS performance equals IS performance (0% degradation). Heavy overfitting produces OOS well below IS. Across V7's 12 folds, OOS Sharpe ratio averaged 87% of IS Sharpe ratio, meaning 13% degradation.

That 13% degradation is consistent across folds, which is more important than the absolute number. If degradation were 5% on some folds and 40% on others, it would suggest regime-dependent overfitting. Consistent 13% suggests stable model generalization with a predictable performance haircut.

Why Twelve Folds Is Enough

With 7.5 years of data and 6-month test windows, 12 folds covers the full dataset with overlap. More folds would mean shorter test windows, reducing the statistical significance of each fold's results. Fewer folds would mean less coverage of different market regimes. Twelve folds was the balance point. Each fold contains 250-400 trades, enough for statistically meaningful win rate and Sharpe estimates. Every fold independently confirms that the strategy has genuine edge, not just IS performance that fails to transfer. That is 12 independent pieces of evidence, and collectively they support the 0.112 PBO that S21 calculates.

← The Bug That Took Two Months: Why Feature Preprocessing Is Not Optional The Most Boring Module That Saved My Backtest →