SVMs¶

Support Vector Machines.

Binary classifier. Their objective is to find a hyperplane that optimally separates the two classes (maximises the margin).

Hard margin¶

Can be used when the data is linearly separable.

The decision function for a linear hard-margin SVM is:

$\hat{y}_i = \text{sgn}(wx_i - b)$

All positive examples should have $wx_i - b \geq 1$ and all negative examples should have $wx_i - b \leq -1$ .

Soft-margin¶

Necessary when the data is not linearly separable.

The loss function for a linear soft-margin SVM is:

$L(x;w,b) = \sum_i \max \{0, m - y_i(wx_i - b) \} + C||w||^2$

Where $w$ and $b$ are parameters to be learnt and $C$ is a hyperparameter.

Dual form¶

Primal form¶

Training¶

Quadratic programming

Kernels¶

The kernel is used to map the data into a high-dimensional space in which it is easier to separate it linearly. This is known as the kernel trick.

Linear¶

$k(x,y) = x \cdot y$

Polynomial¶

$k(x,y) = (a x \cdot y + b)^d$

Sigmoid¶

$k(x,y) = \tanh(a x \cdot y + b)$

RBF¶

$k(x,y) = \exp (-||x-y||^2/2 \sigma^2)$

Advantages¶

The optimisation problem is convex so local optima are not a problem.

Disadvantages¶

Cannot naturally learn multiclass classification problems. Applying an SVM to these requires reformulating the problem as a series of binary classification tasks, either $n$ one-vs-all or $n^2$ one-vs-one tasks. Learning these separately is inefficient and poor for generalisation.

One-Class SVMs¶

See here