By Yinggan XU Dibbla

The notebook basically is a summary for what is mentioned in Hung-Yi Lee’s ML course 2022. Sometimes people have contradictive ideas like whether PPO is an off-policy algorithm (or my grammar mistakes in the notebook) , but in this notebook, I’m going to make things aligned with Hung-Yi Lee’s idea.

Lecture 1 - Intro & Basic Idea of ML

The basic idea of ML, according to Lee, is to find a function. The function can take various types of inputs and output different results as well.

For example:

graph LR id1[Image] id2[Function] id3[Voice Sequence] id4[Text] id1-->id2; id3-->id2; id4-->id2; id2-->id5(Image Classification) id2-->id6(Text Corresponding to Voice) id2-->id7(Text's Emotion)

So how do we find a function? The idea is that we first find a function with unknown parameters and then decide the parameters.

And the pipeline is usually:

graph TD
id1[Define the function with unknown parameters]
id2[Define the loss function]
id3[Optimize the unknown parameters according to the loss]
id1-->id2-->id3;

Some observation

  • More parameters, easier to overfit

Lecture 2 - Pokémon-Digimon Classifier & Basic Theory

This lecture, start from Pokémon-Digimon Classifier, Lee introduces a practical case and then introduce several concepts

Here we use P to indicate Pokémon, D to indicate Digimon.

Pokémon-Digimon Classifier

Observation: Pokémon is usually less complex compared with Digimon. For the edges of P&D, we see that P’s edge consists of less points. So we propose the function: $$ \begin{equation} f(x) = \begin{cases} Digimon, e(x)\geq h \newline Pokémon, e(x)\lt h \end{cases} \end{equation} $$ $x$ is the input image and the $e(x)$ represents how many point needed to draw the edge. $$\mathcal{H} = {1,2,3,…,10,000}\newline |\mathcal(H)|: \text{The number of candidate functions (model’s complexity)} $$

Loss function: Since we have proposed a function with unknown parameter (threshold $h$), we are now going to design a loss function.

  • Given a dataset $D$: $$D = {(x^1,\hat{y}^1), (x^2,\hat{y}^2), …}$$
  • Loss of a threshold $h$ is represented by $$L(h,D) = \frac{1}{N}\sum_{n=1}^{N}l(h,x^n,\hat{y}^n)$$ Here, the $l$ is a identical function. If the output is NOT correct, it outputs 1, otherwise 0.

cross-entropy can be used as well

Training: We are able to collect all P&D to train thus we can find a $h^{all}$ such that: $$h^{all} = arg \min_{h} L(h,D_{all})$$ But we only collect some of the data for training in $D_{all}$. So we get $h_{train}$ through training: $$h^{train} = arg \min_{h} L(h,D_{train})$$ where $D_{train}$ is i.i.d to $D_{all}$.

How close is the $h_{train}$ and $h_{all}$?

We of course wish to have the same threshold (Or to say, the loss for 2 $h$ are close over the $D_{all}$). But we can see that from practice this is not always the case and it depends on how good the training set is.

What do we do? We want: $$L(h^{train},D_{all}) - L(h^{all},D_{all}) \leq \delta$$ What kind of training set fulfil it? $$\forall h \in \mathcal{H}, |L(h,D_{train}) - L(h,D_{all})| \leq \frac{1}{2}\delta$$ This is because:

$$ \begin{aligned} \text{Once the condition holds, }& L(h^{train},D_{all})\newline \leq & L(h^{train},D_{train})+\frac{1}{2}\delta \newline \leq & L(h^{all},D_{train}) +\frac{1}{2}\delta \newline \leq & L(h^{all},D_{all}) + \frac{1}{2}\delta + \frac{1}{2}\delta \end{aligned} $$

The idea is: i) the h in the corresponding set has the lowest loss; ii) we re-use the condition in the last step

What is the probability of sampling bad dataset?

So, for the bad set, there’s at least one $h$ makes the difference greater than $\delta$.

The following discussion is very general:

  • model-agnostic
  • no assumption about data distribution
  • any loss function

Due to the at-least-one property, we can now say that: $$ \begin{aligned} P(\text{$D_{train}$ is bad}) & = \bigcup_{h\in\mathcal{H}} P(\text{$D_{train}$ is bad due to $h$})\newline & \leq \sum_{h \in \mathcal{H}}P(\text{$D_{train}$ is bad due to $h$}) \end{aligned} $$ This is basically of no use, because the summation can exceed 1.

But we have Hoeffding’s Inequality: $$ P(\text{$D_{train}$ is bad due to $h$})\leq2exp(-2N\epsilon^2)\newline \text{where $|L(h,D_{train})-L(h,D_{all})|\gt \epsilon$, $N$ is the size of $D_{train}$}\newline \text{The loss function should be in [0,1]} $$ So, $$ \begin{aligned} P(\text{$D_{train}$ is bad}) & = \bigcup_{h\in\mathcal{H}} P(\text{$D_{train}$ is bad due to $h$})\newline & \leq \sum_{h \in \mathcal{H}}P(\text{$D_{train}$ is bad due to $h$})\newline & \leq \sum_{h \in \mathcal{H}}2exp(-2N\epsilon^2)\newline & = 2|\mathcal{H}|exp(-2N\epsilon^2) \end{aligned} $$

What do we learn form the inequality?

  1. We can have larger N
  2. we can have smaller $|\mathcal{H}|$

Model Complexity

What if the $|\mathcal{H}|$ is infinite? (parameter is continuous)

  1. Everything in Computer is discrete
  2. VC-dimension

Recap $$P(\text{$D_{train}$ is bad}) = 2|\mathcal{H}|exp(-2N\epsilon^2)$$ Why don’t we just use small $|\mathcal{H}|$? So we have to known when we say a dataset is good, we mean $h^{train}$ and $h^{all}$ are close. When we make $|\mathcal{H}|$ very small, the $h^{all}$ may be bad. Then we are merely close to a BAD ideal case.

So how do we balance the trade-off? Yes, DEEP LEARNING