Multimodal learning¶

Multimodal learning concerns tasks which require being able to make sense of multiple types of input, such as images and text, simultaneously.

Datasets¶

CLEVR¶

A synthetic dataset for visual question answering. Given a scene of objects of different shapes, colours and sizes the algorithm must answer questions such as “What color is the cube to the right of the yellow sphere?” and “How many cylinders are in front of the small thing and on the left side of the green object?”.

Notable results:

Introduced in

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, Johnson et al. (2016)

Image Captioning¶

TODO

Visual Question Answering (VQA)¶

Tasks where the algorithm must answer questions given some piece of visual information.

Datasets:

CLEVR
VQA