Definition
An optimization algorithm that adapts the learning rate to the parameters, performing smaller updates for frequently occurring features and larger updates for infrequent ones.
Detailed Explanation
AdaGrad (Adaptive Gradient Algorithm) adapts the learning rate to each parameter by scaling it inversely proportional to the square root of the sum of all historical squared values of the gradient. This approach performs well with sparse data but can cause the learning rate to become infinitesimally small over time, effectively stopping learning.
Use Cases
Natural language processing tasks, Sparse data optimization, Online convex optimization