By Yinggan XU Dibbla
The notebook basically is a summary for what is mentioned in Hung-Yi Lee’s ML course 2022. Sometimes people have contradictive ideas like whether PPO is an off-policy algorithm (or my grammar mistakes in the notebook) , but in this notebook, I’m going to make things aligned with Hung-Yi Lee’s idea.
Lecture 1 - Intro & Basic Idea of ML
The basic idea of ML, according to Lee, is to find a function. The function can take various types of inputs and output different results as well.
For example:
So how do we find a function? The idea is that we first find a function with unknown parameters and then decide the parameters.
And the pipeline is usually:
graph TD
id1[Define the function with unknown parameters]
id2[Define the loss function]
id3[Optimize the unknown parameters according to the loss]
id1-->id2-->id3;
Some observation
- More parameters, easier to overfit
Lecture 2 - Pokémon-Digimon Classifier & Basic Theory
This lecture, start from Pokémon-Digimon Classifier, Lee introduces a practical case and then introduce several concepts
Here we use P to indicate Pokémon, D to indicate Digimon.
Pokémon-Digimon Classifier
Observation: Pokémon is usually less complex compared with Digimon. For the edges of P&D, we see that P’s edge consists of less points. So we propose the function: $$ \begin{equation} f(x) = \begin{cases} Digimon, e(x)\geq h \newline Pokémon, e(x)\lt h \end{cases} \end{equation} $$ $x$ is the input image and the $e(x)$ represents how many point needed to draw the edge. $$\mathcal{H} = {1,2,3,…,10,000}\newline |\mathcal(H)|: \text{The number of candidate functions (model’s complexity)} $$
Loss function: Since we have proposed a function with unknown parameter (threshold $h$), we are now going to design a loss function.
- Given a dataset $D$: $$D = {(x^1,\hat{y}^1), (x^2,\hat{y}^2), …}$$
- Loss of a threshold $h$ is represented by $$L(h,D) = \frac{1}{N}\sum_{n=1}^{N}l(h,x^n,\hat{y}^n)$$ Here, the $l$ is a identical function. If the output is NOT correct, it outputs 1, otherwise 0.
cross-entropy can be used as well
Training: We are able to collect all P&D to train thus we can find a $h^{all}$ such that: $$h^{all} = arg \min_{h} L(h,D_{all})$$ But we only collect some of the data for training in $D_{all}$. So we get $h_{train}$ through training: $$h^{train} = arg \min_{h} L(h,D_{train})$$ where $D_{train}$ is i.i.d to $D_{all}$.
How close is the $h_{train}$ and $h_{all}$?
We of course wish to have the same threshold (Or to say, the loss for 2 $h$ are close over the $D_{all}$). But we can see that from practice this is not always the case and it depends on how good the training set is.
What do we do? We want: $$L(h^{train},D_{all}) - L(h^{all},D_{all}) \leq \delta$$ What kind of training set fulfil it? $$\forall h \in \mathcal{H}, |L(h,D_{train}) - L(h,D_{all})| \leq \frac{1}{2}\delta$$ This is because:
$$ \begin{aligned} \text{Once the condition holds, }& L(h^{train},D_{all})\newline \leq & L(h^{train},D_{train})+\frac{1}{2}\delta \newline \leq & L(h^{all},D_{train}) +\frac{1}{2}\delta \newline \leq & L(h^{all},D_{all}) + \frac{1}{2}\delta + \frac{1}{2}\delta \end{aligned} $$
The idea is: i) the h in the corresponding set has the lowest loss; ii) we re-use the condition in the last step
What is the probability of sampling bad dataset?
So, for the bad set, there’s at least one $h$ makes the difference greater than $\delta$.
The following discussion is very general:
- model-agnostic
- no assumption about data distribution
- any loss function
Due to the at-least-one property, we can now say that: $$ \begin{aligned} P(\text{$D_{train}$ is bad}) & = \bigcup_{h\in\mathcal{H}} P(\text{$D_{train}$ is bad due to $h$})\newline & \leq \sum_{h \in \mathcal{H}}P(\text{$D_{train}$ is bad due to $h$}) \end{aligned} $$ This is basically of no use, because the summation can exceed 1.
But we have Hoeffding’s Inequality: $$ P(\text{$D_{train}$ is bad due to $h$})\leq2exp(-2N\epsilon^2)\newline \text{where $|L(h,D_{train})-L(h,D_{all})|\gt \epsilon$, $N$ is the size of $D_{train}$}\newline \text{The loss function should be in [0,1]} $$ So, $$ \begin{aligned} P(\text{$D_{train}$ is bad}) & = \bigcup_{h\in\mathcal{H}} P(\text{$D_{train}$ is bad due to $h$})\newline & \leq \sum_{h \in \mathcal{H}}P(\text{$D_{train}$ is bad due to $h$})\newline & \leq \sum_{h \in \mathcal{H}}2exp(-2N\epsilon^2)\newline & = 2|\mathcal{H}|exp(-2N\epsilon^2) \end{aligned} $$
What do we learn form the inequality?
- We can have larger N
- we can have smaller $|\mathcal{H}|$
Model Complexity
What if the $|\mathcal{H}|$ is infinite? (parameter is continuous)
- Everything in Computer is discrete
- VC-dimension
Recap $$P(\text{$D_{train}$ is bad}) = 2|\mathcal{H}|exp(-2N\epsilon^2)$$ Why don’t we just use small $|\mathcal{H}|$? So we have to known when we say a dataset is good, we mean $h^{train}$ and $h^{all}$ are close. When we make $|\mathcal{H}|$ very small, the $h^{all}$ may be bad. Then we are merely close to a BAD ideal case.
So how do we balance the trade-off? Yes, DEEP LEARNING