Classification

STOR 390

# package to sample from  themultivariate gaussian distribution
library(mvtnorm)

# calculate distances between points in a data frame
library(flexclust)

# for knn
library(class)

library(tidyverse)

# some helper functions I wrote for this script
# you can find this file in the same folder as the .Rmd document
source('helper_functions.R')

Linear regression

Classification

Classification examples

Build a map

Code to get started with

Math prerequisites

Notation

Some toy examples

Gaussian point clouds

separable point clouds

skewed point clouds

heteroscedastic point clouds

Gaussian mixture model point clouds

Boston Cream

Nearest Centroid Classifier

advantages of NC

disadvantages

Sufficiency

Train the NC classifier

compute the class means

compute distance to test point

given a new test point \(\mathbf{\tilde{x}}\) compute the distance between \(\mathbf{x}\) and each class mean.

classify test point

test point

# test point
x_test <- c(1, 1)

compute the class means

# compute the observed class means
obs_means <- data_gauss %>% 
    group_by(y) %>% 
    summarise_all(mean)

obs_means
## # A tibble: 2 × 3
##        y        x1          x2
##   <fctr>     <dbl>       <dbl>
## 1     -1 -1.151583  0.12338380
## 2      1  1.147330 -0.09823948

training class means

compute distance to each class

# grab each class mean
mean_pos <- select(filter(obs_means, y==1), -y)
mean_neg <- select(filter(obs_means, y==-1), -y)

# compute the euclidean distance from the class mean to the test point
dist_pos <- sqrt(sum((x_test - mean_pos)^2))
dist_neg <- sqrt(sum((x_test - mean_neg)^2))
dist_pos
## [1] 1.108078
dist_neg
## [1] 2.323309

Distance to class means

test points

Make a test grid

# make a grid of test points
test_grid <- expand.grid(x1 = seq(-4, 4, length = 100),
                         x2 = seq(-4, 4, length = 100)) %>% 
            as_tibble()
test_grid
## # A tibble: 10,000 × 2
##           x1    x2
##        <dbl> <dbl>
## 1  -4.000000    -4
## 2  -3.919192    -4
## 3  -3.838384    -4
## 4  -3.757576    -4
## 5  -3.676768    -4
## 6  -3.595960    -4
## 7  -3.515152    -4
## 8  -3.434343    -4
## 9  -3.353535    -4
## 10 -3.272727    -4
## # ... with 9,990 more rows

NC predictions

# compute the distance from each test point to the two class means
# note the use of the apply function (we could have used a for loop)
dist_pos <- apply(test_grid, 1, function(x) sqrt(sum((x - mean_pos)^2)))
dist_neg <- apply(test_grid, 1, function(x) sqrt(sum((x - mean_neg)^2)))

NC predictions

# add distance columns to the test grid data frame
test_grid <- test_grid %>% 
    add_column(dist_pos = dist_pos,
               dist_neg = dist_neg)

# decide which class mean each test point is closest to
test_grid <- test_grid %>% 
             mutate(y_pred = ifelse(dist_pos < dist_neg, 1, -1)) %>% 
             mutate(y_pred=factor(y_pred))
test_grid
## # A tibble: 10,000 × 5
##           x1    x2 dist_pos dist_neg y_pred
##        <dbl> <dbl>    <dbl>    <dbl> <fctr>
## 1  -4.000000    -4 6.459005 5.011564     -1
## 2  -3.919192    -4 6.394793 4.966080     -1
## 3  -3.838384    -4 6.330962 4.921503     -1
## 4  -3.757576    -4 6.267522 4.877857     -1
## 5  -3.676768    -4 6.204487 4.835168     -1
## 6  -3.595960    -4 6.141867 4.793461     -1
## 7  -3.515152    -4 6.079677 4.752762     -1
## 8  -3.434343    -4 6.017929 4.713098     -1
## 9  -3.353535    -4 5.956637 4.674493     -1
## 10 -3.272727    -4 5.895816 4.636976     -1
## # ... with 9,990 more rows

NC predictions

## Warning: Removed 1 rows containing missing values (geom_point).

NC is a linear classifier

higher dimensions

image borrowed from here

image borrowed from here

normal vector, intercept

A hyperplane is given by - normal vector \(\mathbf{w} \in \mathbb{R}^d\) - an intercept \(b \in \mathbb{R}\)

All points \(\mathbf{x}\) in \(\mathbb{R}^d\) that satisfy \(\mathbf{x}^T\mathbf{w} + b\) i.e. \[H = \{\mathbf{x} \in \mathbb{R}^d | \mathbf{x}^T\mathbf{w} + b = 0\}\]

Normal vector, separating hyperplane

Mean difference

Mean difference

The normal vector for NC is given by the differnce of the two class means \[\mathbf{w} = \mathbf{m}_{+} - \mathbf{m}_{-}\] NC intercept given by \[b= - \frac{1}{2}\left(||\mathbf{m}_{+}||_2 - ||\mathbf{m}_{-}||_2 \right)\] # Linear classifiers

New test point \(\mathbf{\tilde{x}}\)=

  1. Compute the discriminant \(f = \mathbf{w}^T \mathbf{\tilde{x}} + b\).
  2. Compute the sign of the discriminant \(\tilde{y}=\text{sign}(f)\).

Classify \(\tilde{y}\) to the positive class if \(\mathbf{w}^T \mathbf{\tilde{x}} + b >0\)

Toy examples

Gaussian point clouds

Training error rate for point clouds: 0.1125.

separable point clouds

Training error rate for separable point clouds: 0

skewed point clouds

Training error rate for skewed point clouds: 0.065

heteroscedastic point clouds

Training error rate for heteroscedastic point clouds: 0.15.

GMM

Training error rate for GMM: 0.41.

Boston Cream

Training error rate for Boston cream: 0.475.

K-nearest-neighbhors

Differences between KNN and NC

  1. KNN is not a linear classifier
  2. KNN has a tuning parameter (k) that needs to be set by the user

No free lunch

More flexibility means:

Bias-varinace tradeoff!

computing KNN (math)

  1. Find distance from test point to each training point
  2. Sort these distances
  3. K nearest neighbors vote on test point’s label

Compute distances

For a new test point \(\tilde{\mathbf{x}}\) fist compute the distance between \(\tilde{\mathbf{x}}\) and each training point i.e. \[d_i = ||\tilde{\mathbf{x}} - \mathbf{x}_i||_2 \text{ for } i = 1, \dots, n\]

Sort points

Next sort these distances and find the \(k\) smallest distances (i.e. let \(d_{i_1}, \dots, d_{i_k}\) be the \(k\) smallest distances).

Vote

Now look at the corresponding labels for these \(k\) closest points \(y_{i_1}, \dots, y_{i_k}\) and have these labels vote (if there is a tie break it randomly). Assign the predicted \(\tilde{y}\) to the winner of this vote.

Computing KNN (code)

k <- 5 # number of neighbors to use
x_test <- c(0, 1) # test point

Compute distances

# grab the training xs and compute the distances to the test point
distances <- train_data %>%
         select(-y) %>%
        dist2(x_test) %>% # compute distances
        c() # this solves an annnoying formatting issue

# print first 5 entries
distances[0:5]     
## [1] 1.17677223 3.17954493 0.09772573 2.28397117 1.87472240

Sort

# add a new column to the data frame and sort
train_data_sorted <- train_data %>% 
        add_column(dist2tst = distances) %>% # the c() solves an annoying formatting issue
        arrange(dist2tst) # sort data points by the distance to the test point
train_data_sorted
## # A tibble: 400 × 4
##             x1        x2      y   dist2tst
##          <dbl>     <dbl> <fctr>      <dbl>
## 1   0.06540233 0.9702020     -1 0.07187061
## 2  -0.02383795 1.0947738      1 0.09772573
## 3  -0.14322246 1.1610164     -1 0.21549703
## 4   0.07268305 1.2348212      1 0.24581256
## 5   0.09913570 0.7579408      1 0.26157322
## 6   0.12200396 0.7609176      1 0.26841264
## 7   0.27122239 0.8741936      1 0.29897967
## 8  -0.25188510 0.7747962      1 0.33787996
## 9  -0.30499744 0.8273767      1 0.35046004
## 10  0.35008260 1.1037569     -1 0.36513465
## # ... with 390 more rows

Find K nearest neighbors

# select the K closest training pionts 
nearest_neighbhors <- slice(train_data_sorted, 1:k) # data are sorted so this picks the top K rows
nearest_neighbhors
## # A tibble: 5 × 4
##            x1        x2      y   dist2tst
##         <dbl>     <dbl> <fctr>      <dbl>
## 1  0.06540233 0.9702020     -1 0.07187061
## 2 -0.02383795 1.0947738      1 0.09772573
## 3 -0.14322246 1.1610164     -1 0.21549703
## 4  0.07268305 1.2348212      1 0.24581256
## 5  0.09913570 0.7579408      1 0.26157322

Find nearest neighbors

Vote

# count number of nearest neighors in each class 
votes <- nearest_neighbhors %>% 
         group_by(y) %>% 
         summarise(votes=n())

votes
## # A tibble: 2 × 2
##        y votes
##   <fctr> <int>
## 1     -1     2
## 2      1     3
 # the [1] is in case of a tie -- then just pick the first class that appears
y_pred <- filter(votes, votes == max(votes))$y[1]
y_pred
## [1] 1
## Levels: -1 1

Vote

KNN Toy examples

Guassian point clouds

## [1] 0.1025

Training error rate for point clouds: 0.1025.

separable clouds

Training error rate for separable point clouds: 0.

Skewed clouds

Training error rate for skewed point clouds: 0.0075.

X

Training error rate for heteroscedastic point clouds: 0.0675.

GMM

Training error rate for GMM: 0.185.

Boston Cream

Training error rate for Boston Cream: 0.005.