Overview
ABSTRACT
This article introduces the notion of unsupervised statistical machine learning, then describes the techniques currently available to perform statistical learning from unlabeled data: partitioning (or clustering), dimensionality reduction, density estimation and finally generative models. It covers the oldest classical algorithms (principal component analysis, k-means) as well as the most recent techniques using deep learning (word representations, autoregressive models, auto-encoders, generative adversarial networks).
Read this article from a comprehensive knowledge base, updated and supplemented with articles reviewed by scientific committees.
Read the articleAUTHOR
-
Bruno SAUVALLE: Chief Mining Engineer - Center de Robotique, MINES ParisTech, Paris, France
INTRODUCTION
The aim of this article is to present methods and techniques for unsupervised statistical learning, i.e. using data that has not been labeled beforehand.
The notion of unsupervised statistical learning may seem difficult to grasp when compared with that of supervised statistical learning, which simply consists of learning a function f:y = f(x) from a very large number of example pairs (x i ,y i ) where x i is the input data and y i is the output result, or label.
However, obtaining a labelled database is difficult and costly, as human intervention is generally required to obtain the labels y i corresponding to the data x i available. The creation of the ImageNet database, which currently contains over 14 million images and is the source of the spectacular successes observed in image analysis in recent years, thus required many years and the intervention of several tens of thousands of "annotators" tasked with viewing images downloaded from the Internet and identifying the objects or animals present in these images.
However, the ever-decreasing cost of capturing, communicating, storing and processing data is naturally leading to the availability of much larger databases, whose exhaustive analysis by humans is clearly impossible.
In this context, unsupervised learning is currently being developed along two lines.
A first way of exploiting a data set statistically without human intervention is to try to learn the distribution of these data. By way of example, language models are programs often based on neural networks which, for a given language, seek to assign a probability, or likelihood value, to each sentence or group of sentences proposed to them. Among other things, this makes it possible to optimize speech recognition or translation software by avoiding proposing sentences that would be considered too unlikely in the language and context considered, for example if they are grammatically incorrect. The data used to build these language models are text corpora freely available on the Internet, and therefore require no particular annotation effort.
A second way of exploiting a large dataset is to use it to build a representation of this type of data, optimized for one or more classes of use. If the aim is simply to visualize data in the form of vectors comprising a large number of coordinates, a reduction in dimensionality to two or three dimensions would seem to be the obvious choice. If you are...
Exclusive to subscribers. 97% yet to be discovered!
Already subscribed? Log in!
KEYWORDS
clustering | dimensionality reduction | generative model
Unsupervised statistical learning
Article included in this offer
"Technological innovations"
(
185 articles
)
Updated and enriched with articles validated by our scientific committees
A set of exclusive tools to complement the resources
Bibliography
Exclusive to subscribers. 97% yet to be discovered!
Already subscribed? Log in!