Computer vision

Data augmentation

The images in the training set are randomly altered in order to improve the generalization of the network.

Cubuk et al. (2018), who evaluate a number of different data augmentation techniques, use the following transforms:

  • Blur - The entire image is blurred by a random amount.
  • Brightness
  • Color balance
  • Contrast
  • Cropping - The image is randomly cropped and the result is fed into the network instead.
  • Cutout - Mask a random square region of the image, replacing it with grey. Was used to get state of the art results on the CIFAR-10, CIFAR-100 and SVHN datasets. Proposed in Improved Regularization of Convolutional Neural Networks with Cutout, DeVries and Taylor (2017)
  • Equalize - Perform histogram equalization on the image. This adjusts the contrast.
  • Flipping - The image is flipped with probability 0.5 and left as it is otherwise. Normally only horizontal flipping is used but vertical flipping can be used where it makes sense - satellite imagery for example.
  • Posterize - Decrease the bits per pixel
  • Rotation
  • Sample pairing - Combine two random images into a new synthetic image. See Data Augmentation by Pairing Samples for Images Classification, Inoue (2018).
  • Shearing
  • Solarize - Pixels above a random value are inverted.
  • Translation



Common Objects in COntext. A dataset for image recognition, segmentation and captioning.

Detection task - Notable results (mAP):


70000 28x28 pixel grayscale images of handwritten digits (10 classes), 60000 in the training set and 10000 in the test set.


Street View House Numbers.

Face recognition

The name of the general topic. Includes face identification and verification.

The normal face recognition pipeline is:

  • Face detection - Identifying the area of the photo that corresponds to the face.
  • Face alignment - Often done by detecting facial landmarks like the nose, eyes and mouth.
  • Feature extraction and similarity calculation


  • Photos being taken at different angles.
  • Different lighting conditions.
  • Changes in facial hair.
  • Glasses.
  • People aging over time.


  • LFW
  • YouTube-Faces
  • CASIA-Webface
  • CelebA

Face identification

Multiclass classification problem. Given an image of a face, determine the identity of the person.

Face verification

Binary classification problem. Given two images of faces, assess whether they are from the same person or not.

Commonly used architectures for solving this problem include Siamese and Triplet networks.

Image segmentation

Partitions an object into meaningful parts with associated labels. May also be referred to as per-pixel classification.

Instance segmentation

Unlike semantic segmentation, different instances of the same object type have to be labelled as separate objects (eg person 1, person 2). Harder than semantic segmentation.

Semantic segmentation

Unlike instance segmentation, in semantic segmentation it is only necessary to predict what class each pixel belongs to, not separate out different instances of the same class.

Weakly-supervised segmentation

Learning to segment from only image-level labels. The labels will describe the classes that exist within the image but not what the class is for every pixel.

The results from weak-supervision are generally poorer than otherwise but datasets tend to be much cheaper to acquire.

When the dataset is only weakly-supervised it can be very hard to correctly label highly-correlated objects that are usually only seen together, such as a train and rails.

Image-to-image translation


  • Daytime to nighttime
  • Greyscale to colour
  • Streetmap to satellite view

Image-to-Image Translation with Conditional Adversarial Networks, Isola et al. (2016)

Object recognition

Region of interest

See ‘region proposal’.

Region proposal

A region in an image (usually defined by a rectangle) identified as containing an object of interest with high probability, relative to the background.

Two-stage detector

The first stage proposes regions that may contain objects of interest. The second stage classifies these regions as either background or one of the classes.

There is often a significant class-imbalance problem since background regions greatly outnumber the other classes.

Contrast with one-stage detectors.

Saliency map

A heatmap over an image which shows each pixel’s importance for the classification.