Natural language processing (NLP)¶
Datasets¶
Labelled¶
WMT¶
Parallel corpora for translation. Aligned on the sentence level.
Notable results in BLEU (higher is better):
English-to-German (2014)
- 28.4 - Attention is All You Need, Vaswani et al. (2017)
- 24.7 - Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Wu et al. (2016)
- 23.8 - Neural Machine Translation in Linear Time, Kalchbrenner et al. (2016)
English-to-French (2014)
Other datasets¶
- bAbI - Dataset for question answering
- GLUE - Stands for General Language Understanding Evaluation. Assesses performance across 11 different tasks including sentiment analysis, question answering and entailment, more details of which can be found on their website. Leaderboard here.
- IMDB - Dataset of movie reviews, used for sentiment classification. Each review is labelled as either positive or negative.
- RACE - Reading comprehension dataset. Leaderboard here.
- RACE: Large-scale ReAding Comprehension Dataset From Examinations, Lai et al. (2017)
- SQuAD - Stanford Question Answering Dataset
- SuperGLUE - Harder successor to the GLUE dataset. Assesses performance across 10 different tasks (more details here). Leaderboard here.
- TIMIT - Speech corpus
Unlabelled¶
A list of some of the most frequently used unlabelled datasets and text corpora, suitable for tasks like language modelling and learning word embeddings.
PTB¶
Stands for ‘Penn Treebank’. Notable results, given in word-level perplexity (lower is better):
- 35.8 - Language Models are Unsupervised Multitask Learners, Radford et al. (2019)
- 47.7 - Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Yang et al. (2017)
- 55.8 - Efficient Neural Architecture Search via Parameter Sharing, Pham et al. (2018)
- 62.4 - Neural Architecture Search with Reinforcement Learning, Zoph and Le (2016)
- 68.7 - Recurrent Neural Network Regularization, Zaremba et al. (2014)
Other datasets¶
Entailment¶
The task of deciding whether one piece of text follows logically from another.
Entity linking¶
The task of finding the specific entity which words or phrases refer to. Not to be confused with Named Entity Recognition.
FastText¶
A simple baseline method for text classification.
The architecture is as follows:
- The inputs are n-grams features from the original input sequence. Using n-grams means some of the word-order information is preserved without the large increase in computational complexity characteristic of recurrent networks.
- An embedding layer.
- A mean-pooling layer averages the embeddings over the length of the inputs.
- A softmax layer gives the class probabilities.
The model is trained with the cross-entropy loss as normal.
Latent Dirichlet Allocation (LDA)¶
Topic modelling algorithm.
Each item/document is a finite mixture over the set of topics. Each topic is a distribution over words. The parameters can be estimated with expectation maximisation. Unlike a simple clustering approach, LDA allows a document to be associated with multiple topics.
Morpheme¶
A word or a part of a word that conveys meaning on its own. For example, ‘ing’, ‘un’, ‘dog’ or ‘cat’.
Named Entity Recognition (NER)¶
Labelling words and word sequences with the type of entity they represent, such as person, place or time.
Not to be confused with entity linking which finds the specific entity (eg the city of London) rather than only the type (place).
Part of speech tagging (POS tagging)¶
Labelling words with ADV, ADJ, PREP etc. Correct labelling is dependent on context - ‘bananas’ can be a noun or an adjective.
Phoneme¶
A unit of sound in a language, shorter than a syllabel. English has 44 phonemes. For example, the long ‘a’ sound in ‘train’ and ‘sleigh’ and the ‘t’ sound in ‘bottle’ and ‘sit’.
Polysemy¶
The existence of multiple meanings for a word.
Stemming¶
Reducing a word to its basic form. This often involves removing suffixes like ‘ed’, ‘ing’ or ‘s’.