By Yinggan XU Dibbla
The tutorial video can be found here This notebook will only cover the basic optimizers and their ideas. However, the optimizers for DL remains a very interesting question.
Background Knowledge
$\mu$-strong Convexity
We can refer to this note. A function $f$ is $\mu$-strong convex if:
$$f(y)\ge f(x)+\nabla f(x)^T(y-x) + \frac{\mu}{2}||y-x||^2\newline \text{for some $\mu\ge0$ and for all $x,y$}$$
Note that Strong convexity doesn’t necessarily require the function to be differentiable, and the gradient is replaced by the sub-gradient when the function is non-smooth.
Lipschitz Continuity
We can refer to this note. A function $f:S\in \mathcal{R}^n \rightarrow \mathcal{R}^m$ is Lipschitz continuous if there is a constant $C$ s.t: $$||f(y)-f(x)||\leq C||y-x||$$
Note that: differentiable at $x$ implies Lipschitz continuous at $x$ implies continuous at $x$ BUT CANNOT REVERSE
Bregman Proximal Inequality
Actually these BG knowledges are not very useful, at least for understanding this note. But I believe it may be useful when encountering some of the optimizer problems.
Optimizers
Notations:
- $\theta_t$: model parameter at timestep $t$
- $\nabla L(\theta_t)$ or $g_t$: gradient of loss function at $\theta_t$, used to compute $\theta_{t+1}$
- $m_{t+1}$: momentum accumulated from timestep 0 to $t$, which is used to compute $\theta_{t+1}$
Concepts
The goal of optimizer is to research the lowest point of loss surface (w.r.t data).
We have online vs offline:
- online use 1 pair of data/label for training
- offline use all pairs of data/label for training
SGD-Stochastic Gradient Decent
- Start from $\theta^0$
- Compute gradient
- Move to $\theta^1 = \theta^0 - \eta \nabla L(\theta^0)$
- Keep going
SGDM-SGD with Momentum
- Start from $\theta^0$
- Movement $v^0 = 0$
- Compute gradient
- Movement $v^1 = \lambda v^0 - \eta \nabla L(\theta^0)$
- Move to $\theta^1 = \theta^0 + v1$
- Movement $v^2 = \lambda v^1 - \eta \nabla L(\theta^1)$
- Keep going
$v^i$ is actually the weighted sum of previous gradients.
Adagrad
$$\theta_t=\theta_{t-1}-\frac{\eta}{\sqrt{\sum^{t-1}{i=0}(g_i)^2}}g{t-1}$$
Adagrad introduces the adaptive learning rate. The previous large steps (extremely large steps) will make the later steps smaller.
RMSProp
$$\theta_t=\theta_{t-1}-\frac{\eta}{\sqrt{v_t}}g_{t-1}$$ $$v1=g_0^2$$ $$v_t=\alpha v_{t-1}+(1-\alpha)(g_{t-1})^2$$
Exponential moving average (EMA) of squared gradients is not monotonically increasing.
Adam
Adam combines the SGDM with RMSProp $$\theta_t=\theta_{t-1}-\frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t$$ $$\hat{m}_t=\frac{m_t}{1-\beta_1^t}$$ $$\hat{v}_t=\frac{v_t}{1-\beta_2^t}$$
Here the $\beta,\epsilon$ are all hyperparameters.