# package to sample from  themultivariate gaussian distribution
library(mvtnorm)

# calculate distances between points in a data frame
library(flexclust)

# for knn
library(class)

library(tidyverse)

# some helper functions I wrote for this script
# you can find this file in the same folder as the .Rmd document
source('helper_functions.R')

Linear regression

supervised
- have both x and y
y is numerical

Classification

supervised
- have both x and y
y is a category

Classification examples

spam or not spam
patient has a disease or not
will someone default on a loan or not
reading text (recognizing letters)
automatically tagging Facebook pictures

Build a map

map some X data
- image, medical records, etc
to categories
- letters, people, disease

Code to get started with

setup section from notes
example_data.R script in classification folder

Math prerequisites

linear regression
vectors
euclidean distance
dot product
multivariate Gaussian distribution

Notation

$n$ labeled training observations $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)$
$d$ variables (dimensions) ($\mathbf{x}_i \in \mathbb{R}^d$)
binary classification $y_i = \pm 1$ (or positive/negative)
$n_+ = $ number of postive training examples (similarly $n_-$)

Some toy examples

cannonical classification examples in 2 dimensions
helpful for intuition
most of these are built on Gaussian distributions

Gaussian point clouds

separable point clouds

heteroscedastic point clouds

Gaussian mixture model point clouds

Nearest Centroid Classifier

classify a test point to the class whose class mean is closest to that point
also called mean differnce

advantages of NC

simple
useful in practice
forms the basis of other classifiers

disadvantages

not flexible
sensitive to outliers

Sufficiency

NC summarises the data with two means
often a good idea to algorithms off simple yet meaningful summaries of the data

Train the NC classifier

compute class means -this is the training step
compute distance from test point to class means
classify test point to nearer class

compute the class means

$\mathbf{m}_{+} = \frac{1}{n_+} \sum_{i \text{ s.t. } y_i = +1}^{n_+} \mathbf{x}_i$
$\mathbf{m}_{-} = \frac{1}{n_-} \sum_{i \text{ s.t. } y_i = -1}^{n_-} \mathbf{x}_i$

compute distance to test point

given a new test point $\mathbf{\tilde{x}}$ compute the distance between $\mathbf{x}$ and each class mean.

compute $d_+ = ||\mathbf{x} - \mathbf{m}_{+}||_2$ (where $||\cdot||_2$ means euclidean distance)
compute $d_- = ||\mathbf{x} - \mathbf{m}_{-}||_2$

classify test point

classify $\mathbf{x}$ to the class corresponding to the smaller distance.
- find the smaller of $d_+$ and $d_-$

test point

# test point
x_test <- c(1, 1)

compute the class means

# compute the observed class means
obs_means <- data_gauss %>% 
    group_by(y) %>% 
    summarise_all(mean)

obs_means

## # A tibble: 2 × 3
##        y        x1          x2
##   <fctr>     <dbl>       <dbl>
## 1     -1 -1.151583  0.12338380
## 2      1  1.147330 -0.09823948

training class means

compute distance to each class

# grab each class mean
mean_pos <- select(filter(obs_means, y==1), -y)
mean_neg <- select(filter(obs_means, y==-1), -y)

# compute the euclidean distance from the class mean to the test point
dist_pos <- sqrt(sum((x_test - mean_pos)^2))
dist_neg <- sqrt(sum((x_test - mean_neg)^2))
dist_pos

## [1] 1.108078

dist_neg

## [1] 2.323309

Distance to class means

test points

Make a test grid

# make a grid of test points
test_grid <- expand.grid(x1 = seq(-4, 4, length = 100),
                         x2 = seq(-4, 4, length = 100)) %>% 
            as_tibble()
test_grid

## # A tibble: 10,000 × 2
##           x1    x2
##        <dbl> <dbl>
## 1  -4.000000    -4
## 2  -3.919192    -4
## 3  -3.838384    -4
## 4  -3.757576    -4
## 5  -3.676768    -4
## 6  -3.595960    -4
## 7  -3.515152    -4
## 8  -3.434343    -4
## 9  -3.353535    -4
## 10 -3.272727    -4
## # ... with 9,990 more rows

NC predictions

# compute the distance from each test point to the two class means
# note the use of the apply function (we could have used a for loop)
dist_pos <- apply(test_grid, 1, function(x) sqrt(sum((x - mean_pos)^2)))
dist_neg <- apply(test_grid, 1, function(x) sqrt(sum((x - mean_neg)^2)))

NC predictions

# add distance columns to the test grid data frame
test_grid <- test_grid %>% 
    add_column(dist_pos = dist_pos,
               dist_neg = dist_neg)

# decide which class mean each test point is closest to
test_grid <- test_grid %>% 
             mutate(y_pred = ifelse(dist_pos < dist_neg, 1, -1)) %>% 
             mutate(y_pred=factor(y_pred))
test_grid

## # A tibble: 10,000 × 5
##           x1    x2 dist_pos dist_neg y_pred
##        <dbl> <dbl>    <dbl>    <dbl> <fctr>
## 1  -4.000000    -4 6.459005 5.011564     -1
## 2  -3.919192    -4 6.394793 4.966080     -1
## 3  -3.838384    -4 6.330962 4.921503     -1
## 4  -3.757576    -4 6.267522 4.877857     -1
## 5  -3.676768    -4 6.204487 4.835168     -1
## 6  -3.595960    -4 6.141867 4.793461     -1
## 7  -3.515152    -4 6.079677 4.752762     -1
## 8  -3.434343    -4 6.017929 4.713098     -1
## 9  -3.353535    -4 5.956637 4.674493     -1
## 10 -3.272727    -4 5.895816 4.636976     -1
## # ... with 9,990 more rows

NC predictions

## Warning: Removed 1 rows containing missing values (geom_point).

NC is a linear classifier

separates plane into two sections

higher dimensions

image borrowed from here

normal vector, intercept

A hyperplane is given by - normal vector $\mathbf{w} \in \mathbb{R}^d$ - an intercept $b \in \mathbb{R}$

All points $\mathbf{x}$ in $\mathbb{R}^d$ that satisfy $\mathbf{x}^T\mathbf{w} + b$ i.e. \[H = \{\mathbf{x} \in \mathbb{R}^d | \mathbf{x}^T\mathbf{w} + b = 0\}\]

Normal vector, separating hyperplane

Mean difference

The normal vector for NC is given by the differnce of the two class means \[\mathbf{w} = \mathbf{m}_{+} - \mathbf{m}_{-}\] NC intercept given by \[b= - \frac{1}{2}\left(||\mathbf{m}_{+}||_2 - ||\mathbf{m}_{-}||_2 \right)\] # Linear classifiers

New test point $\mathbf{\tilde{x}}$=

Compute the discriminant $f = \mathbf{w}^T \mathbf{\tilde{x}} + b$.
Compute the sign of the discriminant $\tilde{y}=\text{sign}(f)$.

Classify $\tilde{y}$ to the positive class if $\mathbf{w}^T \mathbf{\tilde{x}} + b >0$

Gaussian point clouds

Training error rate for point clouds: 0.1125.

separable point clouds

Training error rate for separable point clouds: 0

skewed point clouds

Training error rate for skewed point clouds: 0.065

heteroscedastic point clouds

Training error rate for heteroscedastic point clouds: 0.15.

GMM

Training error rate for GMM: 0.41.

Boston Cream

Training error rate for Boston cream: 0.475.

K-nearest-neighbhors

select $k$ and provide training data
KNN predicts the class of a new point $\tilde{\mathbf{x}}$ by finding the $k$ closest training points to $\tilde{\mathbf{x}}$ then taking a vote
if $k=1$ KNN finds the point in the training data closes to $\tilde{\mathbf{x}}$ and assigns $\tilde{\mathbf{x}}$ to this point’s class.

Differences between KNN and NC

KNN is not a linear classifier
KNN has a tuning parameter (k) that needs to be set by the user

No free lunch

More flexibility means:

better ability to capture complex patters
more prone to overfitting

Bias-varinace tradeoff!

computing KNN (math)

Find distance from test point to each training point
Sort these distances
K nearest neighbors vote on test point’s label

Compute distances

For a new test point $\tilde{\mathbf{x}}$ fist compute the distance between $\tilde{\mathbf{x}}$ and each training point i.e. \[d_i = ||\tilde{\mathbf{x}} - \mathbf{x}_i||_2 \text{ for } i = 1, \dots, n\]

Sort points

Next sort these distances and find the $k$ smallest distances (i.e. let $d_{i_1}, \dots, d_{i_k}$ be the $k$ smallest distances).

Vote

Now look at the corresponding labels for these $k$ closest points $y_{i_1}, \dots, y_{i_k}$ and have these labels vote (if there is a tie break it randomly). Assign the predicted $\tilde{y}$ to the winner of this vote.

Computing KNN (code)

k <- 5 # number of neighbors to use
x_test <- c(0, 1) # test point

Compute distances

# grab the training xs and compute the distances to the test point
distances <- train_data %>%
         select(-y) %>%
        dist2(x_test) %>% # compute distances
        c() # this solves an annnoying formatting issue

# print first 5 entries
distances[0:5]

## [1] 1.17677223 3.17954493 0.09772573 2.28397117 1.87472240

Sort

# add a new column to the data frame and sort
train_data_sorted <- train_data %>% 
        add_column(dist2tst = distances) %>% # the c() solves an annoying formatting issue
        arrange(dist2tst) # sort data points by the distance to the test point
train_data_sorted

## # A tibble: 400 × 4
##             x1        x2      y   dist2tst
##          <dbl>     <dbl> <fctr>      <dbl>
## 1   0.06540233 0.9702020     -1 0.07187061
## 2  -0.02383795 1.0947738      1 0.09772573
## 3  -0.14322246 1.1610164     -1 0.21549703
## 4   0.07268305 1.2348212      1 0.24581256
## 5   0.09913570 0.7579408      1 0.26157322
## 6   0.12200396 0.7609176      1 0.26841264
## 7   0.27122239 0.8741936      1 0.29897967
## 8  -0.25188510 0.7747962      1 0.33787996
## 9  -0.30499744 0.8273767      1 0.35046004
## 10  0.35008260 1.1037569     -1 0.36513465
## # ... with 390 more rows

Find K nearest neighbors

# select the K closest training pionts 
nearest_neighbhors <- slice(train_data_sorted, 1:k) # data are sorted so this picks the top K rows
nearest_neighbhors

## # A tibble: 5 × 4
##            x1        x2      y   dist2tst
##         <dbl>     <dbl> <fctr>      <dbl>
## 1  0.06540233 0.9702020     -1 0.07187061
## 2 -0.02383795 1.0947738      1 0.09772573
## 3 -0.14322246 1.1610164     -1 0.21549703
## 4  0.07268305 1.2348212      1 0.24581256
## 5  0.09913570 0.7579408      1 0.26157322

Find nearest neighbors

Vote

# count number of nearest neighors in each class 
votes <- nearest_neighbhors %>% 
         group_by(y) %>% 
         summarise(votes=n())

votes

## # A tibble: 2 × 2
##        y votes
##   <fctr> <int>
## 1     -1     2
## 2      1     3

 # the [1] is in case of a tie -- then just pick the first class that appears
y_pred <- filter(votes, votes == max(votes))$y[1]
y_pred

## [1] 1
## Levels: -1 1

Guassian point clouds

## [1] 0.1025

Training error rate for point clouds: 0.1025.

Classification

Linear regression

Classification

Classification examples

Build a map

Code to get started with

Math prerequisites

Notation

Some toy examples

Gaussian point clouds

separable point clouds

skewed point clouds

heteroscedastic point clouds

Gaussian mixture model point clouds

Boston Cream

Nearest Centroid Classifier

advantages of NC

disadvantages

Sufficiency

Train the NC classifier

compute the class means

compute distance to test point

classify test point

test point

compute the class means

training class means

compute distance to each class

Distance to class means

test points

NC predictions

NC predictions

NC predictions

NC is a linear classifier

higher dimensions

normal vector, intercept

Normal vector, separating hyperplane

Mean difference

Mean difference

Toy examples

Gaussian point clouds

separable point clouds

skewed point clouds

heteroscedastic point clouds

GMM

Boston Cream

K-nearest-neighbhors

Differences between KNN and NC

No free lunch

computing KNN (math)

Compute distances

Sort points

Vote

Computing KNN (code)

Compute distances

Sort

Find K nearest neighbors

Find nearest neighbors

Vote

Vote

KNN Toy examples

Guassian point clouds

separable clouds

Skewed clouds

X

GMM

Boston Cream