STOR 390
Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from “unlabeled” data.
Supervised learning is the machine learning task of inferring a function from labeled training data.
Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.
Iterate until the cluster assignments stop changing:
For each of the K clusters, compute the cluster centroid.
Assign each observation to the cluster whose centroid is closest.
data
## # A tibble: 400 × 2
## x1 x2
## <dbl> <dbl>
## 1 2.15074931 -0.4792060
## 2 -0.08786829 1.6166332
## 3 1.76506913 -0.1514896
## 4 0.61865671 1.2622023
## 5 1.07803836 -0.5458768
## 6 0.85736662 0.3856852
## 7 1.75866283 0.3809878
## 8 2.21017318 -2.4745634
## 9 0.96406375 0.3872157
## 10 -0.60946315 1.7316120
## # ... with 390 more rows
# number of desired clusters
K <- 3
# run Kmeans algorithm
km_fitted <- kmeans(x=data, centers=K)
km_fitted
## K-means clustering with 3 clusters of sizes 112, 142, 146
##
## Cluster means:
## x1 x2
## 1 -0.3464934 -0.9347975
## 2 -1.2055820 0.6037372
## 3 1.3881640 0.1776493
##
## Clustering vector:
## [1] 3 2 3 3 3 3 3 3 3 2 3 1 3 3 3 2 3 2 3 1 2 3 1 3 2 3 3 3 3 1 3 3 3 3 2
## [36] 2 3 3 1 3 3 1 3 3 3 3 3 1 1 2 3 3 1 3 3 3 1 3 3 1 3 3 3 3 3 3 1 2 3 1
## [71] 1 3 3 1 2 3 3 3 3 3 3 3 3 3 3 2 1 3 3 3 1 3 3 1 3 3 1 2 1 3 1 1 2 3 3
## [106] 3 1 1 3 1 3 3 1 3 3 3 3 3 2 3 3 3 3 3 3 1 3 1 3 2 3 2 2 3 3 1 3 3 3 1
## [141] 3 3 1 1 3 3 3 1 3 1 1 3 1 1 3 3 3 3 3 3 3 2 2 2 3 3 3 3 3 3 3 3 2 1 3
## [176] 1 3 3 3 3 2 3 1 1 3 3 1 3 1 3 3 3 3 1 3 3 3 2 1 3 2 2 2 2 2 2 2 1 2 2
## [211] 2 1 2 2 1 2 2 1 2 2 2 1 2 1 2 1 2 1 2 3 3 1 2 2 2 2 2 1 2 2 3 2 2 1 2
## [246] 1 1 1 2 2 2 1 2 1 2 2 1 2 3 2 2 2 1 2 2 3 1 1 3 2 1 2 1 2 3 2 2 2 1 1
## [281] 1 1 2 2 2 2 1 3 1 3 2 1 2 1 2 3 2 1 1 2 2 2 2 2 1 2 1 2 1 2 2 2 1 2 2
## [316] 2 1 1 2 2 2 2 1 2 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2 1 3 2 1 2 1 2 2 1
## [351] 1 2 1 1 2 1 1 2 2 2 1 1 3 1 2 1 1 2 2 1 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2
## [386] 2 1 2 2 1 3 1 3 3 2 2 1 1 2 2
##
## Within cluster sum of squares by cluster:
## [1] 101.6155 181.8375 197.6010
## (between_SS / total_SS = 57.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
# first 5 entries of vector with cluster assignments
km_fitted$cluster[1:5]
## [1] 3 2 3 3 3
# two dimensional standard normal
X <- rmvnorm(n=200, mean=c(0, 0), sigma=diag(2))
# run kmeans
km_fitted <- kmeans(x=X, centers = 4)