Multimodal learning

Multimodal learning concerns tasks which require being able to make sense of multiple types of input, such as images and text, simultaneously.

Datasets

CLEVR

Dataset for visual question answering.

Image Captioning

TODO

Visual Question Answering (VQA)

TODO