Clustering with K-means

STOR 390

Outline

Unsupervised learning

Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from “unlabeled” data.

Supervised learning

Supervised learning is the machine learning task of inferring a function from labeled training data.

Examples

Unsupervised learning: find a meaningful pattern

Clustering

Two dimensional example (raw data)

Two dimensional example (after clustering)

K-means

K-means algorithm

  1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.

  2. Iterate until the cluster assignments stop changing:

    1. For each of the K clusters, compute the cluster centroid.

    2. Assign each observation to the cluster whose centroid is closest.

Raw data

Random initialization

Assign points to nearest cluster

Iterate again

Final iteration

Raw data

data
## # A tibble: 400 × 2
##             x1         x2
##          <dbl>      <dbl>
## 1   2.15074931 -0.4792060
## 2  -0.08786829  1.6166332
## 3   1.76506913 -0.1514896
## 4   0.61865671  1.2622023
## 5   1.07803836 -0.5458768
## 6   0.85736662  0.3856852
## 7   1.75866283  0.3809878
## 8   2.21017318 -2.4745634
## 9   0.96406375  0.3872157
## 10 -0.60946315  1.7316120
## # ... with 390 more rows

run K-means

# number of desired clusters
K <- 3

# run Kmeans algorithm
km_fitted <- kmeans(x=data, centers=K)

km_fitted
## K-means clustering with 3 clusters of sizes 112, 142, 146
## 
## Cluster means:
##           x1         x2
## 1 -0.3464934 -0.9347975
## 2 -1.2055820  0.6037372
## 3  1.3881640  0.1776493
## 
## Clustering vector:
##   [1] 3 2 3 3 3 3 3 3 3 2 3 1 3 3 3 2 3 2 3 1 2 3 1 3 2 3 3 3 3 1 3 3 3 3 2
##  [36] 2 3 3 1 3 3 1 3 3 3 3 3 1 1 2 3 3 1 3 3 3 1 3 3 1 3 3 3 3 3 3 1 2 3 1
##  [71] 1 3 3 1 2 3 3 3 3 3 3 3 3 3 3 2 1 3 3 3 1 3 3 1 3 3 1 2 1 3 1 1 2 3 3
## [106] 3 1 1 3 1 3 3 1 3 3 3 3 3 2 3 3 3 3 3 3 1 3 1 3 2 3 2 2 3 3 1 3 3 3 1
## [141] 3 3 1 1 3 3 3 1 3 1 1 3 1 1 3 3 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 2 1 3
## [176] 1 3 3 3 3 2 3 1 1 3 3 1 3 1 3 3 3 3 1 3 3 3 2 1 3 2 2 2 2 2 2 2 1 2 2
## [211] 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 1 2 1 2 3 3 1 2 2 2 2 2 1 2 2 3 2 2 1 2
## [246] 1 1 1 2 2 2 1 2 1 2 2 1 2 3 2 2 2 1 2 2 3 1 1 3 2 1 2 1 2 3 2 2 2 1 1
## [281] 1 1 2 2 2 2 1 3 1 3 2 1 2 1 2 3 2 1 1 2 2 2 2 2 1 2 1 2 1 2 2 2 1 2 2
## [316] 2 1 1 2 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 1 3 2 1 2 1 2 2 1
## [351] 1 2 1 1 2 1 1 2 2 2 1 1 3 1 2 1 1 2 2 1 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2
## [386] 2 1 2 2 1 3 1 3 3 2 2 1 1 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 101.6155 181.8375 197.6010
##  (between_SS / total_SS =  57.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Cluster assignments

# first 5 entries of vector with cluster assignments
km_fitted$cluster[1:5]
## [1] 3 2 3 3 3

Visualize results

Changing K

https://shiny.rstudio.com/gallery/kmeans-example.html

K-means intuition

Within class sum of square (WCSS)

Clustering vs. classification

How to pick K

WARNING: clustering algorithms always give you clusters!

Terrifying example

Sample data

# two dimensional standard normal
X <- rmvnorm(n=200, mean=c(0, 0), sigma=diag(2))

Raw data

Fit kmeans with K = 4

# run kmeans
km_fitted <- kmeans(x=X, centers = 4)

Results

K-means is very sensitive to data choices