This assignment is due Saturday, 4/1/17 at 8:48 pm ET.

Recall the human activity recognition data set we discussed in class. You can find details about the data on the UCI repository. The point of this data set is to teach a smart phone to recognize what activity the user is doing based only on the accelerometer and gyroscope. This kind of system is actually deployed in the iPhone health app when it tracks the number of miles you have walked/run each day.

This assignment will give you practice with

library(tidyverse)

# you might have to install some of these libraries
library(class) # KNN
library(kernlab) # kernel SVM
library(caret) # tuning
library(stringr)

train <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/human_activity_train.csv')

test <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/human_activity_test.csv')

# only consider walking upstairs vs downstairs
train <- train %>% 
    filter(activity == 2 | activity == 3)

test <- test %>% 
    filter(activity == 2 | activity == 3)

Hint: You might want to save these two data files on your computer since downloading them can be slow. Many of these questions can be answered by taking the lecture code and modifying it.

Final deliverable: Please write your solutions in R Markdown and separate each question with a new section heading. For the questions that ask you to write helper functions write the helper function in it’s own chunk. Submit the .html file and the .Rmd file.

Subsampling

This homework involves a lot of computation which means it might take a while for your code to run. This can make problem solving hard if you have to wait every time you want to run some code.

You might try writing the code using a smaller data set. When you get the code working you can then run it on the larger, original data set. The code below randomly samples 200 observations (out of the 1320) original observations).

# subsample the data
set.seed(8599)
train <- train[sample(x=1:dim(train)[1], size=200), ]

Use the subsampled data to get the code working (you can also subsample the test data). Once the code is working re-run it on the entire data set (which might take a while, but if your code works you should be able to just chill while it runs).

If the knn() function really takes a long time on your computer (e.g. a minute for one call to knn()) then you can hand in the assignment for the subsampled data.

KNN cross-validation

Recall in the lecture notes that using cross-validation we found that K = 1 is the “best” value for KNN on the Human Activity Recognition dataset. But is this truly the best value of K?

Let’s call the plot of error vs. K the tuning error curve (e.g. the plot in this section from the notes).

# use these k values for KNN
k_values <- seq(from=1, to= 41, by=2)

Q1: KNN Test set error

The mean CV error is mean to be a proxy for the test error. This question explores how well this approximation holds for the HAR data.

1a. Compute the test set error (using the independent test data) for each value of K (number of neighbors for KNN). Use the knn() function from the class package.

1b. Plot the test set error with the the cross-validation tuning error curve (from the lecture i.e. the mean crosss-validation error rate for 10 fold cross-validation).

1c. What is actually the best value of K for both test and CV?

1d. Describe how well the cross-validation error approximates the true test set error?

Q2: What happens when we change the number of folds

The CV error curves can change dramatically depending on the number of folds used. If you are doing M fold CV then each fold should be (approximately) 1/M * (number of training points).

This question explores how the CV curve changes as the number of folds change. Question 2 is an extension of the first question.

2a. Compute the training error for each value of K (= number of neighbors for KNN).

2b. Add the training error to the tuning error curves you made in question 1b (there should now be three tuning error curves).

2c. Write a function that makes this tuning error curve plot with the three error curves (see below for the function skeleton). This question is asking you to take the code you wrote for the questions 1a, 1b, 2a, 2b and turn it into a function so you can look at how these plots change as the number of CV folds change.

2d. Make the tuning error plots for 5, 10, and 20 folds (try 50 if it doesn’t take too long to run).

knn_tuning_error_plot <- function(train, test, k_cv, k_values, cv_seed=NA){
# Returns the tuning error plots for KNN with the three tuning error curves
    # train, CV, and test error
# train and test: are the train and test data
    # both are a data frame with the same column names
    # one column in named y which is the class labels
# k_cv: is the number of cross validation folds
# k_values: is the sequence of K values try for KNN
# cv_seed: is the seed for the cross validation folds
# returns a ggplot object    
    
    # set seed if it is given
    if(!is.na(cv_seed)){
        set.seed(cv_seed)
    }

    
    
    # p <- ggplot() + ...
    # return(p)
}

Hint: 2c should be a matter of putting the code you have thus far into a function.

Q3: Nearest Centroid

Recall the nearest centroid classifer is a simple linear classifier.

3a. Write a function that implements the nearest centroid classifier. The function should take the training and test data as input and return the predictions on the test data set.

train_x <- train %>% select(-activity)
train_y <- train$activity


test_x <- test %>% select(-activity)
test_y <- test$activity

nearest_centroid <- function(train_x, train_y, test_x){
    # returns the predictions for nearest centroid on a test set
    # train_x and test_x: are the train/test x data
        # assume these are both numerical matrices with the same number of columns
    # train_y: is a vector of class labels for the training data
    # return a vector of predicted class labels for the test data
    
    
    # find number of test points
    n_test <- 
    
    # find unique class labels (hint: the unique() function)
    # you will designate one class the "positive" class and the
    # other class the "negative" class
    class_labels <- 
    
    # separate x training data by classes
    train_x_pos <- 
    train_x_neg <-
    
    # compute each class mean
    mean_pos <-
    mean_neg <-
    
    
    # you can EITHER use a for loop or the apply() function to compute the predictiosn
    # WARNING: on my computer a for loop took 45 seconds while apply() took .01 seconds
    ######## compute predictions with a for loop #########
    # initialize test prediciton vector
    test_y_pred <-
    
    # get prediciton for each test point
    for(i in 1:n_test){
        
        # grab one test point
        x <- 
        
        # compute the euclidean distance between the test point and each class mean
        dist_pos <-
        dist_neg <- 
        
        # decide which mean the test point is closer to and classify the test point to that class
        # you can either use an if/else statment or the ifelse() function
        test_y_pred[i] <-
    }
    
    # if you don't want to use apply() ignore this code
    ########### compute predictions with the apply function #############
    
    # use apply to compute the distance from each test point to the positive class mean
    # dist_pos <-
    
    # similarly for the negative class mean
    # dist_neg <- 
    
    # use the ifelse() function to compute the test predictions
    # test_y_pred <-
    
    ##############################################################################
        
    return(test_y_pred)
}


nearest_centroid(train_x, train_y, test_x)

You can write this function many different ways – if you do it another way than above that’s fine (as long as it works!).

You will get 3 bonus points on this assignment if you use the apply function. For example code using the apply() function see this section of the notes and/or read this tutorial.

3b. Using the nearest_centroid function compute the training and test set error for the Human Activity Recognition dataset.

Support Vector Machine

This questions examines how the “optimal” parameter values can change depending on how you do cross-validation and also compares linear SVM to radial SVM.

Let’s see how SVM does on the human activity recognition data: try linear SVM and kernel SVM with a radial kernel. Use the caret package to fit the SVM models for this section.

Hint: The code should be pretty similar to the lecture code, you mostly have to play around with the arguments to train.

Q4: Linear SVM

Use method = "svmLinear" for the caret package to get linear SVM from the kernlab package.

C_values <- tibble(C=10^seq(from=-5, to=5, by=1))

Use cross-validation to select the optimal value of \(C\) for linear SVM. Using this SVM model with this optimal \(C\) value compute both the test and training error.

  • Try 2 different numbers of folds (e.g. K = 5 and K = 10).
  • Try two different classification metrics (e.g. Accuracy and Kappa)

You now have possibly 4 different models (some of these procedures might pick the same \(C\) value).

Warning: You need to set train_y to be a factor for this function i.e. y=factor(train_y).

Q5: Radial Kernel SVM

Use method = "svmRadial" for the caret package to get radial kernel SVM from the kernlab package (it’s called ksvm() in the kernlab package). Type ?ksvm to see the documentation for this function.

# sigma values to use
sigma_values <- 10^seq(from=-5, to=5, by=1)

# hint: you need to use expand.grid to make a grid of sigma and C values.

Use cross-validation to select the optimal value of \(C\) and \(\sigma\) for SVM with a radial Kernel to find the “optimal” values of \(C\) and \(\sigma\). Report both the training error and the test set error for the model you chose in 5.

Hint: See the notes for an example of how to use caret to fit SVM with a polynomial kernel.

For questions 4 and 5 you need to make the tuning parameter grid. For liner SVM there is one parameter, C. For the radial basis kernel there are now two parameters, the original C for SVM another parameter called sigma for the kernel. See this post for some more example code using the caret package. Chapter 7 of the caret book has details about the tuning parameter names.