# Calculus¶

## Euler’s method¶

An iterative method for solving differential equations (ie integration).

## Hessian matrix¶

Let be a function mapping vectors onto real numbers. Then the Hessian is defined as the matrix of second order partial derivatives:

### Applied to neural networks¶

In the context of neural networks, is usually the loss function and is the parameter vector so we have:

The size and therefore cost to compute of the Hessian is quadratic in the number of parameters. This makes it infeasible to compute for most problems.

However, it is of theoretical interest as its properties can tell us a lot about the nature of the loss function we are trying to optimize.

If the Hessian at a point on the loss surface has no negative eigenvalues the point is a local minimum.

### Condition number of the Hessian¶

If the Hessian is ill-conditioned, the loss function may be hard to optimize with gradient descent.

Recall that the condition number of a matrix is the ratio of the highest and lowest singular values and that in an ill-conditioned matrix this ratio is high. Large singular values of the Hessian indicate a large change in the gradient in some direction but small ones indicate very little change. Having both of these means the loss function may have ‘ravines’ which cause many first-order gradient descent methods to zigzag, resulting in slow convergence.

### Relationship to generalization¶

Keskar et al. (2016) argue that when the Hessian evaluated at the solution has many large eigenvalues the corresponding network is likely to generalize less well. Large eigenvalues in the Hessian make the minimum likely to be sharp, which in turn generalize less well since those optima are more sensitive to small changes in the parameters.

## Jacobian matrix¶

Let be a function. Then the Jacobian of can be defined as the matrix of partial derivatives:

### Applied to neural networks¶

It is common in machine learning to compute the Jacobian of the loss function of a network with respect to its parameters. Then and the Jacobian is a vector representing the gradients of the network:

## Partial derivative¶

The derivative of a function of many variables with respect to one of those variables.

The notation for the partial derivative of y with respect to x is

## Rules of differentiation¶

### Multivariate chain rule¶

Used to calculate total derivatives.

### The derivative of a function wrt a function¶

Can be done using the chain rule. For example, can be found by setting and .

Then do

### Inverse relationship¶

In general is the inverse of .

## Total derivative¶

The derivative of a function of many arguments with respect to one of those arguments, taking into account any indirect effects via the other arguments.

The total derivative of with respect to is: