This lecture introduces clustering with the k-means algorithm. The primary reference is ISLR sections 10.1, 10.3.1, and 10.3.3

Borrowing from Wikipedia,

Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from “unlabeled” data.

Where is in contrast with supervised learning

Supervised learning is the machine learning task of inferring a function from labeled training data.

Regression and classification are two examples of supervised learning; there is prespecified \(X\) and \(Y\) data and the goal is to understand the relationship between \(X\) and \(Y\). In linear regression \(Y\) is numerical (e.g. stock price, life expectancy). In classification \(Y\) is categorical (e.g. yes/no, walking/running).

In unsupervised learning there is no prespecified, special \(Y\) variable; there are \(X\) variables and the goal is to find some kind of “meaningful pattern”. The phrase “meaningful pattern” can mean a lot of things, but a very common example is clustering.

library(mvtnorm)
library(tidyverse)

source('synthetic_distributions.R')
source('k_means.R') # contains code for the basic k-means algorithm

Clustering

A basic clustering task attempts to group points together that appear similar. Borrowing from ISLR:

Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other

The below two dimensional illustrates a simple example. Just by eye balling the data we can see two apparent subgroups.

It would be fairly straightforward for you to hand code clusters i.e. assign the points on the left to one cluster and the points on the right to another cluster. You might come up with the following cluster assignment:

A clustering algorithm is a way of automatically generating a cluster assignment.

The above example may seem trivial, but it illustrates the idea. If all data came in 2 or 3 dimensions then we could visually inspect the data and clustering algorithms would be far less important (though sill useful). Most data comes with more than 3 variables and most clustering tasks are far less obvious that the pictures above.

K-means

K-means is perhaps the most simple clustering algorithm (it should kind of remind you of the nearest centroid classifier). K-means groups data points into K subgroups where K is specified by you, the user ahead of time. You decided ahead of time how many clusters you want K-means to find! See below for a longer discussion about selecting K, but you should hear subjective bells going off in your head.

K-means works as follows. The user inputs the value of K then runs the following iterative procedure (this is algorithm 10.1 from ISLR)

  1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.

  2. Iterate until the cluster assignments stop changing:

    1. For each of the K clusters, compute the cluster centroid.

    2. Assign each observation to the cluster whose centroid is closest.

I’ve implemented a minimal example of k-means which you can see here.

Let’s see k-means in action on a data set. The first plot shows the raw data. The successive plots show the cluster assignments after some number of iterations and the current cluster centroids.

We initialize the K-means with random cluster assignments (colors/shapes) and the compute the mean of each cluster (the colored Xs).

Next we reassign points to clusters based on which centroid they are closest to.

And iterate again.

After a bunch of iterations we are left with

What is K-means doing?

K-means is based on the following intuition: points in a cluster should be close to their cluster mean.

Suppose we have \(n\) observations \(x_1, \dots, x_n\) we want to cluster in to K subgroups. Let \(S_1, \dots, S_K\) be sets containing the indices of observations in each cluster. E.g. if \(n=5\) and \(K=2\) then we might have \(S_1 = \{1, 3, 5\}\) and \(S_2 = \{2, 4\}\). Let \(\mu_1, \dotsm \mu_K\) be the mean of each cluster.

We can measure how close each point is to its cluster mean by summing the squared distance from each point to its cluster center i.e. \(\sum_{i \in S_j} ||x_i - \mu_k||^2\). Then we can define the within cluster sum of squares (WCSS) as follows

\[WCSS = \sum_{k=1}^K \sum_{i \in S_j} ||x_i - \mu_k||^2\]

Note that WCSSS depends only on the cluster assignments \(S_1, \dots, S_K\). WCSS is a measure of how well the points cluster. This suggests we select the cluster assignments that minimizes WCSS.

It turns out that finding the cluster assignments \(S_1, \dots, S_K\) to minimize WCSS is NP complete meaning we have no hope of actually solving it exactly for anything but small problems. K-means is a way of approximately solving the WCSS minimization problem.

In the paragraphs above we motivated minimizing WCSS though a heuristic argument. There is actually a statistical interpretation to minimizing WCSS. Suppose each data point is generated independently from one of K Gaussian distributions where each of the K Gaussian distributions has a different mean but identity covariance matrix. It then turns out that finding the maximum likelihood solution to this problem is exactly minimizing the WCSS. This is a special case of the more general gaussian mixture model.

What happens when you change K?

See this shiny app to see what happens when we change \(K\).

Code

You can see a simple implementation of K-means here. There are a number of bells and whistles that you might want to add to the K-means algorithm (e.g. see r documentation). K-means is implemented well in the stat package which comes with R.

K-means works on a numerical data matrix

data
## # A tibble: 400 x 2
##         x1     x2
##      <dbl>  <dbl>
##  1  2.15   -0.479
##  2 -0.0879  1.62 
##  3  1.77   -0.151
##  4  0.619   1.26 
##  5  1.08   -0.546
##  6  0.857   0.386
##  7  1.76    0.381
##  8  2.21   -2.47 
##  9  0.964   0.387
## 10 -0.609   1.73 
## # ... with 390 more rows

You need to decide on K then run kmeans

# number of desired clusters
K <- 3

# run Kmeans algorithm
km_fitted <- kmeans(x=data, centers=K)

km_fitted
## K-means clustering with 3 clusters of sizes 112, 142, 146
## 
## Cluster means:
##           x1         x2
## 1 -0.3464934 -0.9347975
## 2 -1.2055820  0.6037372
## 3  1.3881640  0.1776493
## 
## Clustering vector:
##   [1] 3 2 3 3 3 3 3 3 3 2 3 1 3 3 3 2 3 2 3 1 2 3 1 3 2 3 3 3 3 1 3 3 3 3 2
##  [36] 2 3 3 1 3 3 1 3 3 3 3 3 1 1 2 3 3 1 3 3 3 1 3 3 1 3 3 3 3 3 3 1 2 3 1
##  [71] 1 3 3 1 2 3 3 3 3 3 3 3 3 3 3 2 1 3 3 3 1 3 3 1 3 3 1 2 1 3 1 1 2 3 3
## [106] 3 1 1 3 1 3 3 1 3 3 3 3 3 2 3 3 3 3 3 3 1 3 1 3 2 3 2 2 3 3 1 3 3 3 1
## [141] 3 3 1 1 3 3 3 1 3 1 1 3 1 1 3 3 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 2 1 3
## [176] 1 3 3 3 3 2 3 1 1 3 3 1 3 1 3 3 3 3 1 3 3 3 2 1 3 2 2 2 2 2 2 2 1 2 2
## [211] 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 1 2 1 2 3 3 1 2 2 2 2 2 1 2 2 3 2 2 1 2
## [246] 1 1 1 2 2 2 1 2 1 2 2 1 2 3 2 2 2 1 2 2 3 1 1 3 2 1 2 1 2 3 2 2 2 1 1
## [281] 1 1 2 2 2 2 1 3 1 3 2 1 2 1 2 3 2 1 1 2 2 2 2 2 1 2 1 2 1 2 2 2 1 2 2
## [316] 2 1 1 2 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 1 3 2 1 2 1 2 2 1
## [351] 1 2 1 1 2 1 1 2 2 2 1 1 3 1 2 1 1 2 2 1 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2
## [386] 2 1 2 2 1 3 1 3 3 2 2 1 1 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 101.6155 181.8375 197.6010
##  (between_SS / total_SS =  57.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

The kmeans function returns a bunch of information including the cluster assignments

# first 5 entries of vector with cluster assignments
km_fitted$cluster[1:5]
## [1] 3 2 3 3 3

Since the data are two dimensional we can visualize the cluster assignments

# add a column for cluster assignments
data <- data %>% 
    add_column(cl=factor(km_fitted$cluster))


ggplot(data) +
    geom_point(aes(x=x1, y=x2, color=cl, shape = cl)) +
    theme(panel.background = element_blank())

Relation to classification

Classification is about going from labels to patterns while clustering is about going from patterns to labels. This lecture hopefully reminded you of classification classification lecture In classification we had labeled data (classes) and we tried to find patterns that distinguished different classes (e.g. one class is on the left of a separating hyperplane and the other on the right). In clustering we come up with the class labels ourselves based on patterns in the data.

One way to interpret clustering is as a missing data problem. We might imagine the X data we were given originally had a \(y\) class label column, but this label column was removed. In other words, the class labels are latent variables (variables that exist, but we don’t observe them).

#How to select K

For K-means (and most clustering algorithms) you have to pre-specify \(K\), the number of clusters you want to find. This is a hard problem. Occasionally you have outside or domain specific information that tells you how many clusters you want to find. In most cases you would like to use the data to select the number of clusters. There is not a well established method to select K.

Most of the methods to select \(K\) are cluster validation methods. You first perform clustering to select \(K\) clusters, then perform some kind of hypothesis test to check if the clusters you have found are the result of purely noise. Cluster validation is very important for reasons discussed below.

WARNING: clustering algorithms always give you clusters!

Clustering algorithms are partisan zealots that will always find clusters even if there is nothing there. If you run K-means with K = 5, it will always find 5 clusters. Maybe there were 4 clusters, maybe there were 20 or maybe there were 0 clusters. K-means doesn’t care – it will stick with 4.

The following example should scare you. We will generate 200 data points independently and identically distributed from a standard normal distribution. There is no signal in the data; there should only be one cluster. Next we run K-means with K = 5 and plot the results.

# two dimensional standard normal
X <- rmvnorm(n=200, mean=c(0, 0), sigma=diag(2))

# run kmeans
km_fitted <- kmeans(x=X, centers = 4)

#Additional warnings

Before applying a clustering algorithm to real data you should read ISLR section 10.3.3 and the references therein. K-means (and other clustering algorithms) are very sensitive to the many choices you make about the data such as: center/scaling, removing/adding columns, value of K, etc. In general you should re-run your clustering procedure several times with various different parameter settings.