Definition
An extension of AdaGrad that adapts learning rates based on a moving window of gradient updates to prevent learning rate decay.
Detailed Explanation
AdaDelta dynamically adapts over time by restricting the window of accumulated past gradients to a fixed size. It stores an exponentially decaying average of squared gradients and squared parameter updates, eliminating the need to set an initial learning rate. This approach helps prevent the learning rate from monotonically decreasing like in AdaGrad.
Use Cases
Deep learning model training, Online machine learning, Robust optimization tasks
