Definition
A policy optimization algorithm that guarantees monotonic improvement by constraining policy updates.
Detailed Explanation
TRPO ensures stable policy improvements by limiting the size of policy updates using a KL divergence constraint. It solves a constrained optimization problem to find the largest improvement possible while keeping the policy change within a 'trust region'. This leads to more stable learning compared to standard policy gradient methods.
Use Cases
Robot locomotion, complex game AI, autonomous control systems