Computer vision¶

Tasks which have an image or video as their input. This includes:

Image captioning
Image classification
Image segmentation
Image-to-image translation
Object detection

Challenges¶

Parts of the object may be obscured.
Photos can be taken at different angles.
Different lighting conditions. Both the direction and amount of light may differ, as well as the number of light sources.
Objects belonging to one class can come in a variety of forms.

Data augmentation¶

The images in the training set are randomly altered in order to improve the generalization of the network.

Cubuk et al. (2018), who evaluate a number of different data augmentation techniques, use the following transforms:

Blur - The entire image is blurred by a random amount.
Brightness
Color balance
Contrast
Cropping - The image is randomly cropped and the result is fed into the network instead.
Cutout - Mask a random square region of the image, replacing it with grey. Was used to get state of the art results on the CIFAR-10, CIFAR-100 and SVHN datasets. Proposed in Improved Regularization of Convolutional Neural Networks with Cutout, DeVries and Taylor (2017)
Equalize - Perform histogram equalization on the image. This adjusts the contrast.
Flipping - The image is flipped with probability 0.5 and left as it is otherwise. Normally only horizontal flipping is used but vertical flipping can be used where it makes sense - satellite imagery for example.
Posterize - Decrease the bits per pixel
Rotation
Sample pairing - Combine two random images into a new synthetic image. See Data Augmentation by Pairing Samples for Images Classification, Inoue (2018).
Shearing
Solarize - Pixels above a random value are inverted.
Translation

Datasets¶

CIFAR-10/100¶

CIFAR-10 is a dataset of 60000 32x32 colour images in 10 classes with 6000 images each. CIFAR-100 has 100 classes, with only 600 images for each. The dataset comprises 50000 images in the training set and 10000 in the test.

Notable results - CIFAR-10

Notable results - CIFAR-100

https://keras.io/datasets/#cifar10-small-image-classification

COCO¶

Common Objects in COntext. A dataset for image recognition, segmentation and captioning.

Detection task - Notable results (mAP):

ImageNet (ILSVRC)¶

ILSVRC stands for Imagenet Large Scale Recognition Challenge. Popular image classification task in which the algorithm must use a dataset of ~1.4m images to classify 1000 classes.

Notable results (top-1 accuracy):

NB: Xie et al. (2019) also use unlabeled data.

MNIST¶

70000 28x28 pixel grayscale images of handwritten digits (10 classes), 60000 in the training set and 10000 in the test set.

http://yann.lecun.com/exdb/mnist/

https://keras.io/datasets/#mnist-database-of-handwritten-digits

Pascal VOC¶

PASCAL Visual Object Classes Homepage

SVHN¶

Street View House Numbers. Contains images of the numbers 0-9 (10 classes) in over 600,000 images. Harder than MNIST since the images come from natural scenes.

http://ufldl.stanford.edu/housenumbers/

Face recognition¶

The name of the general topic. Includes face identification and verification.

The normal face recognition pipeline is:

Face detection - Identifying the area of the photo that corresponds to the face.
Face alignment - Often done by detecting facial landmarks like the nose, eyes and mouth.
Feature extraction and similarity calculation

Challenges¶

In addition to the standard challenges in computer vision facial recognition also encounters the following problems:

Changes in facial hair.
Glasses, which may not always be worn.
People aging over time.

Datasets¶

LFW
YouTube-Faces
CASIA-Webface
CelebA

Face identification¶

Multiclass classification problem. Given an image of a face, determine the identity of the person.

Face verification¶

Binary classification problem. Given two images of faces, assess whether they are from the same person or not.

Commonly used architectures for solving this problem include Siamese and Triplet networks.

Image segmentation¶

Partitions an object into meaningful parts with associated labels. May also be referred to as per-pixel classification.

Further reading

U-Net: Convolutional Networks for Biomedical Image Segmentation, Ronneberger et al. (2015)

Instance segmentation¶

Unlike semantic segmentation, different instances of the same object type have to be labelled as separate objects (eg person 1, person 2). Harder than semantic segmentation.

Semantic segmentation¶

Unlike instance segmentation, in semantic segmentation it is only necessary to predict what class each pixel belongs to, not separate out different instances of the same class.

Weakly-supervised segmentation¶

Learning to segment from only image-level labels. The labels will describe the classes that exist within the image but not what the class is for every pixel.

The results from weak-supervision are generally poorer than otherwise but datasets tend to be much cheaper to acquire.

When the dataset is only weakly-supervised it can be very hard to correctly label highly-correlated objects that are usually only seen together, such as a train and rails.

Image-to-image translation¶

Examples:

Daytime to nighttime
Greyscale to colour
Streetmap to satellite view

Example papers

Image-to-Image Translation with Conditional Adversarial Networks, Isola et al. (2016)

Object detection¶

The task of finding objects of interest in a scene and determining what they are.

Object detection algorithms can generally be divided into two categories:

One-stage detectors
Two-stage detectors

One-stage detector¶

A class of object detection algorithm. Contrast with two-stage detectors.

Example papers
Focal Loss for Dense Object Detection, Lin et al. (2017)
YOLO9000: Better, Faster, Stronger, Redmon and Farhadi (2016)
You Only Look Once: Unified, Real-Time Object Detection, Redmon et al. (2015)
SSD: Single Shot MultiBox Detector, Liu et al. (2015)

Region of interest¶

See ‘region proposal’.

Region proposal¶

A region in an image (usually defined by a rectangle) identified as containing an object of interest with high probability, relative to the background.

Two-stage detector¶

A type of object detection algorithm.

The first stage proposes regions that may contain objects of interest. The second stage classifies these regions as either background or one of the classes.

There is often a significant class-imbalance problem since background regions greatly outnumber the other classes.

Contrast with one-stage detectors.

Example papers for the first stage
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Ren et al. (2015)
Edge Boxes: Locating Object Proposals from Edges, Zitnick and Dollar (2014)
Selective Search for Object Recognition, Uijlings et al. (2012)

Example papers for the second stage
Mask R-CNN, He et al. (2017)
Fast R-CNN, Girshick et al. (2015)

Saliency map¶

A heatmap over an image which shows each pixel’s importance for the predicted classification. This makes them very useful for making image classifiers explainable.