Definition
A classic reinforcement learning problem where an agent must choose between multiple actions with unknown reward distributions.
Detailed Explanation
The multi-armed bandit problem involves selecting from a set of actions (arms) with unknown reward distributions. The goal is to maximize cumulative reward by balancing exploration of unknown arms with exploitation of arms known to give good rewards. It's a simplified reinforcement learning setting with no state transitions.
Use Cases
Online advertising, clinical trials, website optimization, content recommendation
