When One Model Is Not Enough: Bootstrap Aggregation for Signal Confidence
Bagged ensemble of L1 models using bootstrap aggregation that reduces signal variance by 25% and provides prediction confidence intervals.
The Single Model Problem
A single XGBoost model gives you one probability estimate. You have no idea whether that estimate is stable or an artifact of the specific training sample. S17 trains 10 XGBoost models per cluster, each on a different bootstrap sample (random sampling with replacement) of the training data. The ensemble's average prediction is more stable than any single model.
The key metric is variance reduction. Across all clusters, the bootstrap ensemble reduced signal variance by 25% compared to a single model. This means fewer false signals caused by model sensitivity to specific training examples.
Confidence Intervals from Disagreement
Beyond the average prediction, the spread across the 10 models provides a natural confidence interval. When all 10 models agree on a strong buy (all above 0.60), the signal is high confidence. When they split (6 above threshold, 4 below), the signal is uncertain.
S17 feeds this disagreement metric into S14's confidence calibration. Signals where the ensemble agrees strongly receive higher calibrated confidence, reinforcing the sizing advantage. Signals where the ensemble disagrees get lower confidence and smaller position sizes.
The Computational Trade-Off
Running 10 models per cluster means 60 total L1 models (10 per cluster times 6 clusters). This is why the engine reports L1=60 on initialization. The computational cost is 10x a single model, but XGBoost inference is fast enough that 60 model evaluations complete in milliseconds. The trade-off of 10x compute for 25% variance reduction is clearly worthwhile for a system where stability matters more than raw speed. In production, all 60 models run on every bar, and the ensemble average drives the trading decision.