Clustering with K-means

STOR 390

Outline

unsupervised vs. supervised learning
clustering
K-means

Unsupervised learning

Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from “unlabeled” data.

Supervised learning

Supervised learning is the machine learning task of inferring a function from labeled training data.

Examples

supervised
- regression (e.g. linear model)
- classification (e.g. KNN)
unsupervised
- clustering (e.g. K-means)
- dimensionality reduction (e.g. PCA)

Unsupervised learning: find a meaningful pattern

what does “meaningful” mean?

Clustering

Goal: automatically find meaningful subgroups in the data

Two dimensional example (raw data)

Two dimensional example (after clustering)

K-means

simple clustering algorithm
- similar go nearest centroid
user picks K

K-means algorithm

Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
Iterate until the cluster assignments stop changing:
1. For each of the K clusters, compute the cluster centroid.
2. Assign each observation to the cluster whose centroid is closest.

Random initialization

Assign points to nearest cluster

Raw data

data

## # A tibble: 400 × 2
##             x1         x2
##          <dbl>      <dbl>
## 1   2.15074931 -0.4792060
## 2  -0.08786829  1.6166332
## 3   1.76506913 -0.1514896
## 4   0.61865671  1.2622023
## 5   1.07803836 -0.5458768
## 6   0.85736662  0.3856852
## 7   1.75866283  0.3809878
## 8   2.21017318 -2.4745634
## 9   0.96406375  0.3872157
## 10 -0.60946315  1.7316120
## # ... with 390 more rows

run K-means

# number of desired clusters
K <- 3

# run Kmeans algorithm
km_fitted <- kmeans(x=data, centers=K)

km_fitted

## K-means clustering with 3 clusters of sizes 112, 142, 146
## 
## Cluster means:
##           x1         x2
## 1 -0.3464934 -0.9347975
## 2 -1.2055820  0.6037372
## 3  1.3881640  0.1776493
## 
## Clustering vector:
##   [1] 3 2 3 3 3 3 3 3 3 2 3 1 3 3 3 2 3 2 3 1 2 3 1 3 2 3 3 3 3 1 3 3 3 3 2
##  [36] 2 3 3 1 3 3 1 3 3 3 3 3 1 1 2 3 3 1 3 3 3 1 3 3 1 3 3 3 3 3 3 1 2 3 1
##  [71] 1 3 3 1 2 3 3 3 3 3 3 3 3 3 3 2 1 3 3 3 1 3 3 1 3 3 1 2 1 3 1 1 2 3 3
## [106] 3 1 1 3 1 3 3 1 3 3 3 3 3 2 3 3 3 3 3 3 1 3 1 3 2 3 2 2 3 3 1 3 3 3 1
## [141] 3 3 1 1 3 3 3 1 3 1 1 3 1 1 3 3 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 2 1 3
## [176] 1 3 3 3 3 2 3 1 1 3 3 1 3 1 3 3 3 3 1 3 3 3 2 1 3 2 2 2 2 2 2 2 1 2 2
## [211] 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 1 2 1 2 3 3 1 2 2 2 2 2 1 2 2 3 2 2 1 2
## [246] 1 1 1 2 2 2 1 2 1 2 2 1 2 3 2 2 2 1 2 2 3 1 1 3 2 1 2 1 2 3 2 2 2 1 1
## [281] 1 1 2 2 2 2 1 3 1 3 2 1 2 1 2 3 2 1 1 2 2 2 2 2 1 2 1 2 1 2 2 2 1 2 2
## [316] 2 1 1 2 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 1 3 2 1 2 1 2 2 1
## [351] 1 2 1 1 2 1 1 2 2 2 1 1 3 1 2 1 1 2 2 1 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2
## [386] 2 1 2 2 1 3 1 3 3 2 2 1 1 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 101.6155 181.8375 197.6010
##  (between_SS / total_SS =  57.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Cluster assignments

# first 5 entries of vector with cluster assignments
km_fitted$cluster[1:5]

## [1] 3 2 3 3 3

K-means intuition

points in a cluster should be close to their cluster mean.

Within class sum of square (WCSS)

Clustering vs. classification

classification: labels -> patterns
clustering: patterns -> labels
missing data/latent variables

How to pick K

some what of an art
cluster validation

WARNING: clustering algorithms always give you clusters!

K-means always returns K clusters
even if there is nothing there

Terrifying example

200 IID data points
there is no signal in the data
only be one cluseter
run K-means with K = 5

Sample data

# two dimensional standard normal
X <- rmvnorm(n=200, mean=c(0, 0), sigma=diag(2))

Fit kmeans with K = 4

# run kmeans
km_fitted <- kmeans(x=X, centers = 4)

K-means is very sensitive to data choices

center/scale
different variables
different K

Clustering with K-means

Outline

Unsupervised learning

Supervised learning

Examples

Unsupervised learning: find a meaningful pattern

Clustering

Two dimensional example (raw data)

Two dimensional example (after clustering)

K-means

K-means algorithm

Raw data

Random initialization

Assign points to nearest cluster

Iterate again

Final iteration

Raw data

run K-means

Cluster assignments

Visualize results

Changing K

K-means intuition

Within class sum of square (WCSS)

Clustering vs. classification

How to pick K

WARNING: clustering algorithms always give you clusters!

Terrifying example

Sample data

Raw data

Fit kmeans with K = 4

Results

K-means is very sensitive to data choices