Computer vision

Tasks which have an image or video as their input. This includes:


  • Parts of the object may be obscured.
  • Photos can be taken at different angles.
  • Different lighting conditions. Both the direction and amount of light may differ, as well as the number of light sources.
  • Objects belonging to one class can come in a variety of forms.

Data augmentation

The images in the training set are randomly altered in order to improve the generalization of the network.

Cubuk et al. (2018), who evaluate a number of different data augmentation techniques, use the following transforms:

  • Blur - The entire image is blurred by a random amount.
  • Brightness
  • Color balance
  • Contrast
  • Cropping - The image is randomly cropped and the result is fed into the network instead.
  • Cutout - Mask a random square region of the image, replacing it with grey. Was used to get state of the art results on the CIFAR-10, CIFAR-100 and SVHN datasets. Proposed in Improved Regularization of Convolutional Neural Networks with Cutout, DeVries and Taylor (2017)
  • Equalize - Perform histogram equalization on the image. This adjusts the contrast.
  • Flipping - The image is flipped with probability 0.5 and left as it is otherwise. Normally only horizontal flipping is used but vertical flipping can be used where it makes sense - satellite imagery for example.
  • Posterize - Decrease the bits per pixel
  • Rotation
  • Sample pairing - Combine two random images into a new synthetic image. See Data Augmentation by Pairing Samples for Images Classification, Inoue (2018).
  • Shearing
  • Solarize - Pixels above a random value are inverted.
  • Translation



CIFAR-10 is a dataset of 60000 32x32 colour images in 10 classes with 6000 images each. CIFAR-100 has 100 classes, with only 600 images for each. The dataset comprises 50000 images in the training set and 10000 in the test.

Notable results - CIFAR-10

Notable results - CIFAR-100


70000 28x28 pixel grayscale images of handwritten digits (10 classes), 60000 in the training set and 10000 in the test set.


Street View House Numbers. Contains images of the numbers 0-9 (10 classes) in over 600,000 images. Harder than MNIST since the images come from natural scenes.

Face recognition

The name of the general topic. Includes face identification and verification.

The normal face recognition pipeline is:

  • Face detection - Identifying the area of the photo that corresponds to the face.
  • Face alignment - Often done by detecting facial landmarks like the nose, eyes and mouth.
  • Feature extraction and similarity calculation


In addition to the standard challenges in computer vision facial recognition also encounters the following problems:

  • Changes in facial hair.
  • Glasses, which may not always be worn.
  • People aging over time.


  • LFW
  • YouTube-Faces
  • CASIA-Webface
  • CelebA

Face identification

Multiclass classification problem. Given an image of a face, determine the identity of the person.

Face verification

Binary classification problem. Given two images of faces, assess whether they are from the same person or not.

Commonly used architectures for solving this problem include Siamese and Triplet networks.

Image segmentation

Partitions an object into meaningful parts with associated labels. May also be referred to as per-pixel classification.

Instance segmentation

Unlike semantic segmentation, different instances of the same object type have to be labelled as separate objects (eg person 1, person 2). Harder than semantic segmentation.

Semantic segmentation

Unlike instance segmentation, in semantic segmentation it is only necessary to predict what class each pixel belongs to, not separate out different instances of the same class.

Weakly-supervised segmentation

Learning to segment from only image-level labels. The labels will describe the classes that exist within the image but not what the class is for every pixel.

The results from weak-supervision are generally poorer than otherwise but datasets tend to be much cheaper to acquire.

When the dataset is only weakly-supervised it can be very hard to correctly label highly-correlated objects that are usually only seen together, such as a train and rails.

Image-to-image translation


  • Daytime to nighttime
  • Greyscale to colour
  • Streetmap to satellite view

Object detection

The task of finding objects of interest in a scene and determining what they are.

Object detection algorithms can generally be divided into two categories:

  • One-stage detectors
  • Two-stage detectors

Region of interest

See ‘region proposal’.

Region proposal

A region in an image (usually defined by a rectangle) identified as containing an object of interest with high probability, relative to the background.

Two-stage detector

A type of object detection algorithm.

The first stage proposes regions that may contain objects of interest. The second stage classifies these regions as either background or one of the classes.

There is often a significant class-imbalance problem since background regions greatly outnumber the other classes.

Contrast with one-stage detectors.

Saliency map

A heatmap over an image which shows each pixel’s importance for the predicted classification. This makes them very useful for making image classifiers explainable.