This lecture introduces clustering with the k-means algorithm. The primary reference is ISLR sections 10.1, 10.3.1, and 10.3.3

Borrowing from Wikipedia,

Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from “unlabeled” data.

Where is in contrast with supervised learning

Supervised learning is the machine learning task of inferring a function from labeled training data.

Regression and classification are two examples of supervised learning; there is prespecified \(X\) and \(Y\) data and the goal is to understand the relationship between \(X\) and \(Y\). In linear regression \(Y\) is numerical (e.g. stock price, life expectancy). In classification \(Y\) is categorical (e.g. yes/no, walking/running).

In unsupervised learning there is no prespecified, special \(Y\) variable; there are \(X\) variables and the goal is to find some kind of “meaningful pattern”. The phrase “meaningful pattern” can mean a lot of things, but a very common example is clustering.

library(mvtnorm)
library(tidyverse)

source('synthetic_distributions.R')
source('k_means.R') # contains code for the basic k-means algorithm

Clustering

A basic clustering task attempts to group points together that appear similar. Borrowing from ISLR:

Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other

The below two dimensional illustrates a simple example. Just by eye balling the data we can see two apparent subgroups.