By Yinggan XU Dibbla
In this Lecture, Lee introduces the idea of select best model through validation set performance. Lee also explains how deep NN outperforms fat (wide) NN.
Validation set
The CORE question we want to figure out is Why I used validation set but still overfit?
graph LR id1[model 1 with para space H1] id2[model 2 with para space H2] id3[model 3 with para space H3] id4[Validation Set] id1-->id4 id2-->id4 id3-->id4 id5[Validation-Loss-1=0.4] id6[Validation-Loss-2=0.3] id7[Validation-Loss-3=0.2] id4-->id5 id4-->id6 id4-->id7
Then we are going to choose model 3 with $h_3\in\mathcal{H_3}$.
The Validation can also be view as a process of training model: $$\mathcal{H_{val}} = {\mathcal{H_1},\mathcal{H_2},\mathcal{H_3}}$$ We are going to choose a $h_{val}$ such that: $$h_{val} = arg \min_{h_{val}}L(h,D_{val})$$ Using validation set is “training with $D_{val}$”, where your model is $\mathcal{H}={h_1^*,h_2^*,h_3^*}$. Recap the inequality we gained from Lecture 1&2: $$P(D_{train}\ is\ bad)\leq 2|\mathcal{H}|exp(-2N\epsilon^2)$$ Here, the $\mathcal{H_{val}}$ is small so you have small chance to select a bad model. BUT if your $\mathcal{H_{val}}$ is large enough, you still have high chance $P(D_{val}\ is\ bad)$.
Why deep?
As mentioned in Lecture 1&2, we can generate any function we want. For any function, we can partition it into piece-wise-linear function and then use different ReLU/Sigmoid(softer version) to generate each piece. And then for any input:
graph LR id1[x1] id2[x2] id3[x3] id4(+ weight and bias) id5(+ weight and bias) id6(+ weight and bias) id7[Sigmoid] id8[Sigmoid] id9[Sigmoid] id1-->id4-->id7 id2-->id4 id3-->id4 id1-->id5-->id8 id2-->id5 id3-->id5 id1-->id6-->id9 id2-->id6 id3-->id6 id7-->id10[sum + bias] id8-->id10 id9-->id10 id10-->id11[y: Goal function]
Why Deep not Fat?
- Hidden layer can represent any function
- But deep is more efficient than fat
- Deep network can use smaller $|\mathcal{H}|$ to represent the same function as compared with fat network (It’s expotentially better than shallow ones)
Fewer candidate, still small loss.