Support Vector Machines.

Binary classifier. Their objective is to find a hyperplane that optimally separates the two classes (maximises the margin).

Hard margin

Can be used when the data is linearly separable.

The decision function for a linear hard-margin SVM is:

\hat{y}_i = \text{sgn}(wx_i - b)

All positive examples should have wx_i - b \geq 1 and all negative examples should have wx_i - b \leq -1.


Necessary when the data is not linearly separable.

The loss function for a linear soft-margin SVM is:

L(x;w,b) = \sum_i \max \{0, m - y_i(wx_i - b) \} + C||w||^2

Where w and b are parameters to be learnt and C is a hyperparameter.

Dual form

Primal form


Quadratic programming


The kernel is used to map the data into a high-dimensional space in which it is easier to separate it linearly. This is known as the kernel trick.


k(x,y) = x \cdot y


k(x,y) = (a x \cdot y + b)^d


k(x,y) = \tanh(a x \cdot y + b)


k(x,y) = \exp (-||x-y||^2/2 \sigma^2)


  • The optimisation problem is convex so local optima are not a problem.


  • Cannot naturally learn multiclass classification problems. Applying an SVM to these requires reformulating the problem as a series of binary classification tasks, either n one-vs-all or n^2 one-vs-one tasks. Learning these separately is inefficient and poor for generalisation.

One-Class SVMs

See here