[latexpage]
The gradient of a differentiable function $f:\mathbb{R}^n \rightarrow \mathbb{R}$ contains the first derivatives of the function with respect to each variable. The gradient is useful to find the linear approximation of the function near a point.
Definition
The gradient of $f$ at $x_0$, denoted $\nabla f(x_0)$, is the vector in $\mathbb{R}^n$ given by
\[ \nabla f\left(x_0\right) = \left(\begin{array}{c} \dfrac{\partial f}{\partial x_1}(x) \\[0.5em] \vdots \\[0.5em] \dfrac{\partial f}{\partial x_n}(x) \end{array}\right). \]
Examples:
● Distance function: The distance function from a point $p \in \mathbb{R}^2$ to another point $x \in \mathbb{R}^2$ is defined as
$$
\rho(x)=\|x-p\|_2=\sqrt{\left(x_1-p_1\right)^2+\left(x_2-p_2\right)^2} .
$$
The function is differentiable, provided $(x, y) \neq(p, q)$, which we assume. Then
$$
\nabla \rho(x)=\frac{1}{\sqrt{\left(x_1-p_1\right)^2+\left(x_2-p_2\right)^2}}\left(\begin{array}{l}
x_1-p_1 \\
x_2-p_2
\end{array}\right) .
$$
● Log-sum-exp function: Consider the ‘‘log-sum-exp’’ function $\operatorname{lse}: \mathbb{R}^2 \rightarrow \mathbb{R}$, with values
$$
\operatorname{lse}(x):=\log \left(e^{x_1}+e^{x_2}\right) .
$$
The gradient of $L$ at $x$ is
$$
\nabla \operatorname{lse}(x)=\frac{1}{z_1+z_2}\left(\begin{array}{c}
z_1 \\
z_2
\end{array}\right) .
$$
where $z_i:=e^{x_i}, i=1,2$. More generally, the gradient of the function $\operatorname{lse}: \mathbb{R}^n \rightarrow \mathbb{R}$ with values
$$
\operatorname{lse}(x)=\log \left(\sum_{i=1}^n e^{x_i}\right)
$$
is given by
$$
\nabla f(x)=\frac{1}{\sum_{i=1}^n e^{x_i}}\left(\begin{array}{c}
e^{x_1} \\
\ldots \\
e^{x_n}
\end{array}\right)=\frac{1}{Z} z,
$$
where $z=\left(e^{x_1}, \ldots, e^{x_n}\right)$, and $Z=\sum_{i=1}^n z_i$.
Composition rule with an affine function
If $A \in \mathbb{R}^{m \times n}$ is a matrix, and $b \in \mathbb{R}^m$ is a vector, the function $g: \mathbb{R}^m \rightarrow \mathbb{R}$ with values
$$
g(x)=f(A x+b)
$$
is called the composition of the affine map $x \rightarrow A x+b$ with $f$ with $f$. Its gradient is given by (see here for proof)
$$
\nabla g(x)=A^T \nabla f(A x+b) .
$$
Geometric interpretation
Geometrically, the gradient can be read on the plot of the level set of the function. Specifically, at any point $x$, the gradient is perpendicular to the level set and points outwards from the sub-level set (that is, it points towards higher values of the function).