Multimodal learning

Multimodal learning concerns tasks which require being able to make sense of multiple types of input, such as images and text, simultaneously.

Datasets

CLEVR

A synthetic dataset for visual question answering. Given a scene of objects of different shapes, colours and sizes the algorithm must answer questions such as “What color is the cube to the right of the yellow sphere?” and “How many cylinders are in front of the small thing and on the left side of the green object?”.

Notable results:

Image Captioning

TODO

Visual Question Answering (VQA)

Tasks where the algorithm must answer questions given some piece of visual information.

Datasets: