When Your Model Says 70% but Reality Says 55%: Fixing Probability Calibration
Platt scaling and isotonic regression to calibrate L1 probability outputs, reducing calibration error from 8% to 2%.
The Calibration Problem
XGBoost outputs probabilities, but they are not reliable probabilities. When a V7 L1 model outputs 0.70, the actual win rate for those signals might be 55% or 62% or 70%. Without calibration, you cannot trust the probability scores for sizing decisions, confidence ranking, or risk management.
S37 applies Platt scaling (logistic regression on model outputs vs actual outcomes) and isotonic regression to convert raw model outputs into calibrated probabilities. After calibration, a model output of 0.65 corresponds to an actual win rate of 63-67%.
Reducing Calibration Error
Pre-calibration, the average calibration error across all clusters was 8%. That means raw model probabilities were systematically off by 8 percentage points. Post-calibration (Platt scaling), calibration error dropped to 2%. Isotonic regression produced even lower error (1.5%) but overfits on smaller datasets.
V7 uses Platt scaling as the primary calibration method because it has fewer parameters and generalizes better to new data. Isotonic regression is used for validation comparison only. The calibration is performed on a held-out calibration set that is separate from both training and test sets, preventing calibration overfitting.
Why Calibration Enables Everything Else
S37 is foundational for several other modules. S14 (logit accuracy weighting) relies on calibrated probabilities for confidence ranking. S05 (dynamic position sizing) uses calibrated confidence for Kelly-derived sizing. S28 (multi-timeframe confluence) adjusts calibrated confidence based on alignment. Without S37, all of these modules would be operating on unreliable inputs. Reducing calibration error from 8% to 2% does not just improve one module. It improves the entire chain of modules that depend on probability estimates. That cascading effect makes S37 one of the highest-leverage modules in the system despite having no direct impact on trade entry or exit.