Calculus¶

Euler’s method¶

An iterative method for solving differential equations (ie integration).

Hessian matrix¶

Let $f:\mathbb{R}^n \rightarrow \mathbb{R}$ be a function mapping vectors onto real numbers. Then the Hessian is defined as the matrix of second order partial derivatives:

$H_{ij} = \frac{\partial^2 f}{\partial x_i x_j}$

Applied to neural networks¶

In the context of neural networks, $f$ is usually the loss function and $x$ is the parameter vector so we have:

$H_{ij} = \frac{\partial^2 L}{\partial \theta_i \theta_j}$

The size and therefore cost to compute of the Hessian is quadratic in the number of parameters. This makes it infeasible to compute for most problems.

However, it is of theoretical interest as its properties can tell us a lot about the nature of the loss function we are trying to optimize.

If the Hessian at a point on the loss surface has no negative eigenvalues the point is a local minimum.

Condition number of the Hessian¶

If the Hessian is ill-conditioned, the loss function may be hard to optimize with gradient descent.

Recall that the condition number of a matrix is the ratio of the highest and lowest singular values and that in an ill-conditioned matrix this ratio is high. Large singular values of the Hessian indicate a large change in the gradient in some direction but small ones indicate very little change. Having both of these means the loss function may have ‘ravines’ which cause many first-order gradient descent methods to zigzag, resulting in slow convergence.

Relationship to generalization¶

Keskar et al. (2016) argue that when the Hessian evaluated at the solution has many large eigenvalues the corresponding network is likely to generalize less well. Large eigenvalues in the Hessian make the minimum likely to be sharp, which in turn generalize less well since those optima are more sensitive to small changes in the parameters.

Further reading
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks, Sagun et al. (2017)
Sharp Minima Can Generalize For Deep Nets, Dinh et al. (2017)
On Large Batch Training for Deep Learning: Generalization Gap and Sharp Minima, Keskar et al. (2016)

Jacobian matrix¶

Let $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$ be a function. Then the Jacobian of $f$ can be defined as the matrix of partial derivatives:

$J_{ij} = \frac{\partial f_i}{\partial x_j}$

Applied to neural networks¶

It is common in machine learning to compute the Jacobian of the loss function of a network with respect to its parameters. Then $m$ in $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$ is 1 and the Jacobian is a vector representing the gradients of the network:

$J_i = \frac{\partial L}{\partial \theta_i}$

Partial derivative¶

The derivative of a function of many variables with respect to one of those variables.

The notation for the partial derivative of y with respect to x is $\frac{\partial y}{\partial x}$

Rules of differentiation¶

Sum rule¶

$(f+g)' = f' + g'$

Product rule¶

$(fg)' = fg' + f'g$

Quotient rule¶

$(f/g)' = (f'g + fg')/g^2$

Reciprocal rule¶

$(1/f)' = -f'/f^2$

Power rule¶

$(x^a)' = ax^{a-1}$

Exponentials¶

$(a^{bx})' = a^{bx} \cdot b\log(a)$

Logarithms¶

$(\log_a x)' = 1/(x \ln a)$

$(\ln(x))' = 1/x$

Chain rule¶

$\frac{dy}{dx} = \frac{dy}{dz} \cdot \frac{dz}{dx}$

Multivariate chain rule¶

Used to calculate total derivatives.

$\frac{dy}{dx} = \frac{dy}{da} \cdot \frac{da}{dx} + \frac{dy}{db} \cdot \frac{db}{dx}$

The derivative of a function wrt a function¶

Can be done using the chain rule. For example, $\partial x^6/\partial x^2$ can be found by setting $y=x^6$ and $z=x^2$ .

Then do $\partial y/\partial z$

$= \partial y/\partial x \cdot \partial x/\partial z$

$= 6x^5 \cdot 1/{2x}$

$= 3x^4$

Inverse relationship¶

In general $dy/dx$ is the inverse of $dx/dy$ .

Matrix differentiation¶

$\frac{dX}{dX} = I$

$\frac{d X^T Y}{dX} = Y$

$\frac{d YX}{dX} = Y^T$

Total derivative¶

The derivative of a function of many arguments with respect to one of those arguments, taking into account any indirect effects via the other arguments.

The total derivative of $z(x,y)$ with respect to $x$ is:

$\frac{dz}{dx} = \frac{\partial z}{\partial x} + \frac{\partial z}{\partial y} \frac{dy}{dx}$