By Yinggan XU Dibbla

In this Lecture, Lee introduces the idea of select best model through validation set performance. Lee also explains how deep NN outperforms fat (wide) NN.

Validation set

The CORE question we want to figure out is Why I used validation set but still overfit?

graph LR
id1[model 1 with para space H1]
id2[model 2 with para space H2]
id3[model 3 with para space H3]

id4[Validation Set]
id1-->id4
id2-->id4
id3-->id4

id5[Validation-Loss-1=0.4]
id6[Validation-Loss-2=0.3]
id7[Validation-Loss-3=0.2]

id4-->id5
id4-->id6
id4-->id7

Then we are going to choose model 3 with $h_3\in\mathcal{H_3}$.

The Validation can also be view as a process of training model: $$\mathcal{H_{val}} = {\mathcal{H_1},\mathcal{H_2},\mathcal{H_3}}$$ We are going to choose a $h_{val}$ such that: $$h_{val} = arg \min_{h_{val}}L(h,D_{val})$$ Using validation set is “training with $D_{val}$”, where your model is $\mathcal{H}={h_1^*,h_2^*,h_3^*}$. Recap the inequality we gained from Lecture 1&2: $$P(D_{train}\ is\ bad)\leq 2|\mathcal{H}|exp(-2N\epsilon^2)$$ Here, the $\mathcal{H_{val}}$ is small so you have small chance to select a bad model. BUT if your $\mathcal{H_{val}}$ is large enough, you still have high chance $P(D_{val}\ is\ bad)$.

Why deep?

As mentioned in Lecture 1&2, we can generate any function we want. For any function, we can partition it into piece-wise-linear function and then use different ReLU/Sigmoid(softer version) to generate each piece. And then for any input:

graph LR
id1[x1]
id2[x2]
id3[x3]

id4(+ weight and bias)
id5(+ weight and bias)
id6(+ weight and bias)

id7[Sigmoid]
id8[Sigmoid]
id9[Sigmoid]

id1-->id4-->id7
id2-->id4
id3-->id4

id1-->id5-->id8
id2-->id5
id3-->id5

id1-->id6-->id9
id2-->id6
id3-->id6

id7-->id10[sum + bias]
id8-->id10
id9-->id10
id10-->id11[y: Goal function]

Why Deep not Fat?

  • Hidden layer can represent any function
  • But deep is more efficient than fat
  • Deep network can use smaller $|\mathcal{H}|$ to represent the same function as compared with fat network (It’s expotentially better than shallow ones)

Fewer candidate, still small loss.