These notes cover cross-validation. The primary reference is ISLR section 5.1.
For predictive modeling we collect a bunch of labeled data, build a model using that data, then deploy that model on some future data. We primarily care about how our model performs on the future data. For example, to build a spam filter you might collect a bunch of emails, manually label them as spam/not spam, train a classifier using this data then deploy the classifier on your gmail account. The upshot from this discussion is that when you are building your model you don’t have access to the data you really care about i.e. the future test data.
For this lecture I am assuming you are familiar with
The beginning of this lecture introduces some stats/programming concepts. The rest of the lecture focuses on selecting k for k-nearest-neighbors: first through a validation set then through cross-validation. Finally we discuss using KNN to automatically recognize human activities using data collected from an iPhone.
A common assumption we make in statistics is that our data are independent and identically distributed (iid) random variables. I will assume the reader is some what familiar with what iid means. This assumptions is rarely perfectly true (e.g. sampling biases), but in many many cases it is true enough. The iid assumption breaks down in a big way for data such as time series. The upshot is we are usually comfortable with the iid assumption.
The iid assumption is what allows us to feel comfortable that we can gather some data and build a model which will then work well on some future, unseen data. A lot of the time when things go wrong with statistical models it is because something has strongly violated the iid assumption (e.g. due to the way you gathered your training data, your test data looks different than the training data).
For real data we rarely know the “true” random distribution that generated the data (is there even such thing?) For the purpose of studying statistical models it can be useful to generate synthetic data from a known distribution and see what happens. For example, to study classical hypothesis testing we might generated data from two normal distributions with different means and see how far apart the empirical means are.
For classification we might generate two classes of data from Gaussian distributions with different means then see how well a classifier is able to do on this synthetic data.
In the first part of this lecture we will generate random data from some known distribution. We first generate training data which our model is allowed to see during training. In this lecture we will use this training data set to select the k for k-nearest-neighbors. This will involve breaking the training set up into smaller data sets (discussed below).
We also generate some test data which our model is not allowed to know about during training. The test data set is the data we will really care about.
We are using the computer to generate random data (ok it is really pseudorandom but we will pretend it’s truly random). This means the code in this lecture is not deterministic i.e. it will give you different numbers every time you run it. This is not good for teaching.
We can use the set.seed
function to set the random seed. All this means is the computer will now generate the same random numbers every time you run the code. For example,
# sample 5 numbers from 1-100000
sample(1:100000, 5)
## [1] 63475 80030 40023 11953 10250
sample(1:100000, 5)
## [1] 52256 90819 21343 59640 17814
set.seed(3443)
sample(1:100000, 5)
## [1] 2218 97756 27628 79129 70252
set.seed(3443)
sample(1:100000, 5)
## [1] 2218 97756 27628 79129 70252
Recall on the previous lecture we discuss k-nearest-neighbors (KNN)
# package to sample from the multivariate gaussian distribution
library(mvtnorm)
library(flexclust)
library(class)
library(tidyverse)
library(stringr)
# some helper functions I wrote for this script
# you can find this file in the same folder as the .Rmd document
source('knn_functions.R')
source('synthetic_distributions.R')
Notice the source
function. I wrote some helper functions in separate R scripts – if you want to run the code in this lecture you’ll need to download these scripts as well (see github). I wrote these functions for two reasons
get_knn_error_rates()
function gets used a lotI am going to use the words train and test to describe several different data sets. The training data is the data we use to fit a model. For linear regression this is the data we use to find the \(\beta\) coefficients by minimizing the sum of squared residuals. The test data is the data we use to evaluate a model. For KNN the train data is the data that get’s used to vote on the class label of a new data point (KNN doesn’t really involve any training).
Most of this lecture involves using different training/test data sets to evaluate a model in different settings. This may be a little confusing at first, but you will get used to it.
Let’s generate some synthetic training and test data. For the purpose of this lecture what the true distribution is doesn’t really matter, but you can see the details in the synthetic_distributions.R
script. I encourage you to play around with the synthetic distribution and re-run the code in this lecture (e.g. change the parameters, try different distributions).
# the mixture means should be the same for both training and test sets
mean_seed <- 238
# draw train and test data
data <- gmm_distribution2d(n_neg=200, n_pos=201, mean_seed=mean_seed, data_seed=1232)
## Warning: package 'bindrcpp' was built under R version 3.4.4
test_data <- gmm_distribution2d(n_neg=1000, n_pos=1000, mean_seed=mean_seed, data_seed=52345)
You could uncomment this code to get a different synthetic distribution and see what happens to the figures below.
# data <- two_class_guasssian_meatballs(n_pos=200, n_neg=200,
# mu_pos=c(1,0), mu_neg=c(-1,0),
# sigma_pos=diag(2), sigma_neg=diag(2),
# seed=100)
#
# test_data <- two_class_guasssian_meatballs(n_pos=1000, n_neg=1000,
# mu_pos=c(1,0), mu_neg=c(-1,0),
# sigma_pos=diag(2), sigma_neg=diag(2),
# seed=3240)
The training data look are shown below.
Now let’s fit KNN with k = 5 (just like the previous lecture).
The training error rate is 0.0349127 and the test error is 0.0695. Notice the training error is better than the test error. It’s almost always true that a statistical algorithm will perform better on the data it was trained on that on an independent test set (hence the problem of overfitting).
Now let’s look at the predictions resulting from KNN for different values of K. First we show what the predictions will be at every point in the plane (ok really every point in our test grid).
k_values <- c(1, 3, 5, 9, 17, 33, 65, 399, 401)