Model Validation6 min readOctober 28, 2024

Your Backtest Is Lying (By a Known Amount)

Calculates realistic 15% performance haircut accounting for data mining, multiple testing, and implementation shortfall.

HaircutData Mining BiasMultiple Testing

The Three Sources of Overstated Performance

Every backtest overstates live performance for three quantifiable reasons. First, data mining bias: we tested many configurations and kept the best one. Second, multiple testing: we evaluated multiple feature sets, thresholds, and model architectures. Third, implementation shortfall: the gap between simulated and actual execution.

S42 quantifies each source and applies a combined haircut to the raw backtest results. Data mining bias accounts for approximately 7% haircut. Multiple testing adds 5%. Implementation shortfall adds 3%. The total haircut is approximately 15%.

How the Haircut Is Calculated

Data mining bias is estimated using the formula from White's Reality Check: haircut scales with the logarithm of the number of configurations tested. V7 tested approximately 200 parameter configurations during development. This translates to a 7% expected bias.

Multiple testing bias uses the Bonferroni-like correction for the number of independent tests conducted. With 38 features, 6 clusters, and 5 model architectures, the effective number of independent tests is approximately 50, yielding a 5% adjustment.

Implementation shortfall is estimated from the difference between S20's slippage model and zero-slippage results, plus an additional buffer for latency, requoting, and platform-specific issues. The 3% figure is conservative.

Why 15% Is Acceptable

A 15% haircut on 533.9R means the realistic expected total R is approximately 454R over 7.5 years. That is still strongly positive. More importantly, the haircut-adjusted monthly return of approximately 3.7% still exceeds the FTMO challenge requirement of 10% in 30 days (achievable in roughly 2.7 months at 3.7%/month). S42 exists to prevent self-deception. It is easy to fall in love with your backtest results and forget that they are the ceiling of expected performance, not the floor. Applying a systematic haircut keeps expectations grounded and ensures the system is built to succeed even under conservative assumptions.

← Simulating 5,000 Alternate Realities: Block Bootstrap Monte Carlo Testing Every Possible Way to Be Wrong: CSCV for Strategy Evaluation →