Natural language processing (NLP)

Datasets

Labelled

Other datasets

  • bAbI - Dataset for question answering
  • GLUE - Stands for General Language Understanding Evaluation. Assesses performance across 11 different tasks including sentiment analysis, question answering and entailment, more details of which can be found on their website. Leaderboard here.
  • IMDB - Dataset of movie reviews, used for sentiment classification. Each review is labelled as either positive or negative.
  • RACE - Reading comprehension dataset. Leaderboard here.
  • RACE: Large-scale ReAding Comprehension Dataset From Examinations, Lai et al. (2017)
  • SQuAD - Stanford Question Answering Dataset
  • SuperGLUE - Harder successor to the GLUE dataset. Assesses performance across 10 different tasks (more details here). Leaderboard here.
  • TIMIT - Speech corpus

Unlabelled

A list of some of the most frequently used unlabelled datasets and text corpora, suitable for tasks like language modelling and learning word embeddings.

Entailment

The task of deciding whether one piece of text follows logically from another.

Entity linking

The task of finding the specific entity which words or phrases refer to. Not to be confused with Named Entity Recognition.

FastText

A simple baseline method for text classification.

The architecture is as follows:

  1. The inputs are n-grams features from the original input sequence. Using n-grams means some of the word-order information is preserved without the large increase in computational complexity characteristic of recurrent networks.
  2. An embedding layer.
  3. A mean-pooling layer averages the embeddings over the length of the inputs.
  4. A softmax layer gives the class probabilities.

The model is trained with the cross-entropy loss as normal.

Latent Dirichlet Allocation (LDA)

Topic modelling algorithm.

Each item/document is a finite mixture over the set of topics. Each topic is a distribution over words. The parameters can be estimated with expectation maximisation. Unlike a simple clustering approach, LDA allows a document to be associated with multiple topics.

Latent Dirichlet Allocation, Blei et al. (2003)

Morpheme

A word or a part of a word that conveys meaning on its own. For example, ‘ing’, ‘un’, ‘dog’ or ‘cat’.

Named Entity Recognition (NER)

Labelling words and word sequences with the type of entity they represent, such as person, place or time.

Not to be confused with entity linking which finds the specific entity (eg the city of London) rather than only the type (place).

Part of speech tagging (POS tagging)

Labelling words with ADV, ADJ, PREP etc. Correct labelling is dependent on context - ‘bananas’ can be a noun or an adjective.

Phoneme

A unit of sound in a language, shorter than a syllabel. English has 44 phonemes. For example, the long ‘a’ sound in ‘train’ and ‘sleigh’ and the ‘t’ sound in ‘bottle’ and ‘sit’.

Polysemy

The existence of multiple meanings for a word.

Stemming

Reducing a word to its basic form. This often involves removing suffixes like ‘ed’, ‘ing’ or ‘s’.