← Back to Research
ML Architecture8 min readDecember 26, 2024

Teaching a PPO Agent When to Exit: Reinforcement Learning for Trade Management

Proximal Policy Optimization RL agent for exit timing that improved average exit R from 0.78 to 0.92 through hold-or-exit decision optimization.

Reinforcement LearningPPOExit Timing

Why RL for Exits

Supervised learning trains a model to predict a label: will this trade be profitable? Reinforcement learning trains an agent to make sequential decisions: should I hold or exit right now? Exit management is inherently sequential. The optimal action depends on current unrealized P&L, market conditions, and how long you have been in the trade. RL handles this naturally.

S13 uses Proximal Policy Optimization (PPO), a stable RL algorithm that constrains policy updates to prevent catastrophic forgetting. The agent observes 30 features per timestep including trade-specific metrics (bars_held, current_r, mfe_r, mae_r) alongside market features.

From 0.78R to 0.92R Average

The headline improvement is average exit R going from 0.78 to 0.92. That 0.14R improvement per trade across 4,505 trades is +631R total. But this overstates the PPO agent's contribution because other exit modules (Hurst-adaptive giveback, LSTM exit prediction) also improved during development.

The PPO agent's isolated contribution, tested by A/B comparison with rule-based exits while keeping other modules fixed, was approximately +0.06R per trade. Still meaningful at +270R total, but honest accounting requires separating the RL contribution from the full exit pipeline improvement.

What RL Struggles With

The PPO agent performs well in trending and mean-reverting regimes where the reward signal is clear. It struggles in transitional periods where the optimal action changes rapidly. During regime transitions, the agent tends to exit too early because its training data does not contain enough transition examples to learn appropriate hold behavior.

This is why S13 works as part of an ensemble. The LSTM exit model (L3) provides directional predictions. The PPO agent provides hold-or-exit decisions. The Hurst-adaptive giveback provides hard limits. No single component handles all market states well. The ensemble covers each component's blind spots, producing the overall 59.2% win rate and 1.49% max DD that define V7 performance.