SVMs

Support Vector Machines.

Binary classifier. Their objective is to find a hyperplane that optimally separates the two classes (maximises the margin).

Hard margin

Can be used when the data is linearly separable.

The decision function for a linear hard-margin SVM is:

\hat{y}_i = \text{sgn}(wx_i - b)

All positive examples should have wx_i - b \geq 1 and all negative examples should have wx_i - b \leq -1.

Soft-margin

Necessary when the data is not linearly separable.

The loss function for a linear soft-margin SVM is:

L(x;w,b) = \sum_i \max \{0, m - y_i(wx_i - b) \} + C||w||^2

Where w and b are parameters to be learnt and C is a hyperparameter.

Dual form

Primal form

Training

Quadratic programming

Kernels

The kernel is used to map the data into a high-dimensional space in which it is easier to separate it linearly. This is known as the kernel trick.

Linear

k(x,y) = x \cdot y

Polynomial

k(x,y) = (a x \cdot y + b)^d

Sigmoid

k(x,y) = \tanh(a x \cdot y + b)

RBF

k(x,y) = \exp (-||x-y||^2/2 \sigma^2)

Advantages

  • The optimisation problem is convex so local optima are not a problem.

Disadvantages

  • Cannot naturally learn multiclass classification problems. Applying an SVM to these requires reformulating the problem as a series of binary classification tasks, either n one-vs-all or n^2 one-vs-one tasks. Learning these separately is inefficient and poor for generalisation.

One-Class SVMs

See here