By Yinggan XU Dibbla

The tutorial video can be found here This notebook will only cover the basic optimizers and their ideas. However, the optimizers for DL remains a very interesting question.

Background Knowledge

$\mu$-strong Convexity

We can refer to this note. A function $f$ is $\mu$-strong convex if:

$$f(y)\ge f(x)+\nabla f(x)^T(y-x) + \frac{\mu}{2}||y-x||^2\newline \text{for some $\mu\ge0$ and for all $x,y$}$$

Note that Strong convexity doesn’t necessarily require the function to be differentiable, and the gradient is replaced by the sub-gradient when the function is non-smooth.

Lipschitz Continuity

We can refer to this note. A function $f:S\in \mathcal{R}^n \rightarrow \mathcal{R}^m$ is Lipschitz continuous if there is a constant $C$ s.t: $$||f(y)-f(x)||\leq C||y-x||$$

Note that: differentiable at $x$ implies Lipschitz continuous at $x$ implies continuous at $x$ BUT CANNOT REVERSE

Bregman Proximal Inequality

Actually these BG knowledges are not very useful, at least for understanding this note. But I believe it may be useful when encountering some of the optimizer problems.

Optimizers

Notations:

  • $\theta_t$: model parameter at timestep $t$
  • $\nabla L(\theta_t)$ or $g_t$: gradient of loss function at $\theta_t$, used to compute $\theta_{t+1}$
  • $m_{t+1}$: momentum accumulated from timestep 0 to $t$, which is used to compute $\theta_{t+1}$

Concepts

The goal of optimizer is to research the lowest point of loss surface (w.r.t data).

We have online vs offline:

  • online use 1 pair of data/label for training
  • offline use all pairs of data/label for training

SGD-Stochastic Gradient Decent

  1. Start from $\theta^0$
  2. Compute gradient
  3. Move to $\theta^1 = \theta^0 - \eta \nabla L(\theta^0)$
  4. Keep going

SGDM-SGD with Momentum

  1. Start from $\theta^0$
  2. Movement $v^0 = 0$
  3. Compute gradient
  4. Movement $v^1 = \lambda v^0 - \eta \nabla L(\theta^0)$
  5. Move to $\theta^1 = \theta^0 + v1$
  6. Movement $v^2 = \lambda v^1 - \eta \nabla L(\theta^1)$
  7. Keep going

$v^i$ is actually the weighted sum of previous gradients.

Adagrad

$$\theta_t=\theta_{t-1}-\frac{\eta}{\sqrt{\sum^{t-1}{i=0}(g_i)^2}}g{t-1}$$

Adagrad introduces the adaptive learning rate. The previous large steps (extremely large steps) will make the later steps smaller.

RMSProp

$$\theta_t=\theta_{t-1}-\frac{\eta}{\sqrt{v_t}}g_{t-1}$$ $$v1=g_0^2$$ $$v_t=\alpha v_{t-1}+(1-\alpha)(g_{t-1})^2$$

Exponential moving average (EMA) of squared gradients is not monotonically increasing.

Adam

Adam combines the SGDM with RMSProp $$\theta_t=\theta_{t-1}-\frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t$$ $$\hat{m}_t=\frac{m_t}{1-\beta_1^t}$$ $$\hat{v}_t=\frac{v_t}{1-\beta_2^t}$$

Here the $\beta,\epsilon$ are all hyperparameters.