This is a scribe of CS294 082 by Prof. Gerald Friedland from UC Berkeley
The Idea of Function Counting Theorem
First we start with Function Counting Theorem (Cover’s Theorem, Thomas M. Cover 1965). For example, we have a 2-dimensional space with 4 points. We have multiple ways to linearly separate these points:
Look at $l_5$, it separates $x_1$, $x_4$ on the left side and $x_2$, $x_3$ on the right side. We can construct 2 kinds of labelling:
- $x_1$, $x_4$ is Class-1, $x_2$, $x_3$ is Class-2
- $x_1$, $x_4$ is Class-2, $x_2$, $x_3$ is Class-1.
One may search all the possible labelling methods and find that there are 14 in total. Cover’s theorem promotes such observation to general case by proposing a $C(m,d)$, which is the number of binary labeling of these points. $d$ is the dimension and $m$ is the number of points. Cover proved that:
$$C(m,d)=C(m,d-1)+C(m-1,d-1)$$
with base cases: $$C(m,1) = 2,\ C(1,d)=2$$
The necessary and sufficient condition for Cover’s Theorem is the general position. If $m$ points are in general position in $d$-dimensional space if any $d+1$ subset of these points are not in a $d-1$ hyperplane.
Function Counting Theorem and MEC
Now we have a situation: 4 points namely $x_1,x_2,x_3,x_4$ in 2-dimensional space. Then we are able to calculate by Cover’s Theorem that $C(4,2)=14$. However, for 4 inputs, we have $2^4=16$ possible binary labeling in total. Which 2 of 16 linear separations cannot be achieved? First of all, we can find those 2 by tedious analysis (XOR & NXOR). However, this is impossible for larger point sets.
Then we introduce the Information Theory!
We saw that the information content of a table can be upper-bounded by measuring the information content of the target function. (Gerald Friedland, 2023)
In this case, for example:
x1 | x2 | XOR |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
The upper bound for target function is $-4\times(\frac{1}{2}\log{\frac{1}{2}}+\frac{1}{2}\log{\frac{1}{2}})=4[bits]$ (Shannon’s Entropy). This is what we need to memorize out of the table. To compare with this number and check if we are able to memorize these information, we have to be able to compute the amount of information that can be memorized by our linear separator. In other words, what is the memory capacity of our neuron? Here we take a single neuron as our linear separator.
graph LR;
x1--w1-->Neuron
x2--w2-->Neuron
Neuron-->Threshold-->id[Fire/Not-Fire - Binary Label]
Here neuron contains a threshold $b$ inside
We define MEC as:
A model’s intellectual capacity is memory equivalent to n bits when the model is able to represent all $2^n$ binary labeling functions of n points. (Gerald Friedland, 2023)
According to Gerald Friedland, the fractal dimension of a single neuron with 3 parameters (two weights and one bias), is $D = \log_2 C(4, 3) = \log_2 14 = 3.807… bits$, which is less than the 4 bits of complexity of XOR and NXOR. This is where we can combine both FCT and MEC.
We also have some corollaries out of such analysis:
The MEC for memorizing a v-variable boolean functions is $\log_2 2^{2v}=2v [bits]$. (Gerald Friedland, 2023)