This is an additional but useful note.

First recap the derivatives for scalars, for example: $\frac{dy}{dx} = nx^{n-1}$ for $y = x^n$. And we all know the rules for different kinds of functions/composed functions.

Note that the derivative does not always exist.

When we generalize derivatives to gradients, we are generalizing scalars vectors. In this case, the shape matters.

scalarvector
scalar$\frac{\partial y}{\partial x}$$\frac{\partial y}{\partial \textbf{x}}$
scalar$\frac{\partial \textbf{y}}{\partial x}$$\frac{\partial \textbf{y}}{\partial \textbf{x}}$

Case 1: y is scalar, x is vector

$$x = [x_1,x_2,x_3,\cdots,x_n]^T$$ $$\frac{\partial y}{\partial \textbf{x}}=[\frac{\partial y}{\partial x_1},\frac{\partial y}{\partial x_2},\cdots,\frac{\partial y}{\partial x_n}]$$

Noted that, x is a column vector and the result is a row vector. Example: $$\frac{\partial}{\partial\textbf{x}}x_1^2+2x_2^2=[2x_1,4x_2]$$ This is like finding a slope in a scalar field, a direction for largest value difference.

For composed function, the rule is basically the same as scalar situation. But if $y=<\textbf{u},\textbf{v}>$, we have $g = \textbf{u}^T\frac{\partial \textbf{v}}{\partial\textbf{x}}+\textbf{v}^T\frac{\partial \textbf{u}}{\partial\textbf{x}}$.

Case 2: y is a vector, x is a scalar

$$\textbf{y}=[y_1,y_2,y_3,\cdots,y_n]^T$$ $$\frac{\partial\textbf{y}}{\partial x}=[\frac{\partial y_1}{\partial x},\frac{\partial y_2}{\partial x},\frac{\partial y_3}{\partial x},\cdots,\frac{\partial y_n}{\partial x},]^T$$

It remains a column.

Case 3: Both vector

Suppose $\textbf{x}$ with length $n$, $\textbf{y}$ with length $m$. $$\frac{\partial \textbf{y}}{\partial \textbf{x}}=[\frac{\partial y_1}{\partial\textbf{x}},\frac{\partial y_2}{\partial\textbf{x}},\frac{\partial y_3}{\partial\textbf{x}},\cdots,\frac{\partial y_m}{\partial\textbf{x}}]^T$$

And here we go: