This tutorial will give you practice with

ISLR is a good reference for all of these topics (4.3, 5.1, 6.2, 9.1-9.3, 10.2)

Prerequisites: this tutorial assumes you are familiar with the above models (or are willing to learn about them). The questions are fairly independent so you can skip parts you don’t like/understand. This tutorial also assumes you are familiar with R.

Hints: I suggest writing all of the code in an R script then transferring the answers to a .Rmd file.

The data

The data include 651 randomly selected movies scraped from the IMDb and Rotten Tomatoes websites. The data were generously provided by Mine Cetinkaya-Rundel and you can find the original data set on her website.

# if you don't have these packages then install them
# install.packages('tidyverse')
# install.packages('GGally')

movies <- read_csv('')

# fix a missing value
movies[movies[, 'title' ] == 'The End of America', 'runtime'] <- 73

Take a first look at the data

## # A tibble: 6 × 32
##                  title   title_type       genre runtime mpaa_rating
##                  <chr>        <chr>       <chr>   <dbl>       <chr>
## 1          Filly Brown Feature Film       Drama      80           R
## 2             The Dish Feature Film       Drama     101       PG-13
## 3  Waiting for Guffman Feature Film      Comedy      84           R
## 4 The Age of Innocence Feature Film       Drama     139          PG
## 5          Malevolence Feature Film      Horror      90           R
## 6          Old Partner  Documentary Documentary      78     Unrated
## # ... with 27 more variables: studio <chr>, thtr_rel_year <dbl>,
## #   thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## #   dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## #   imdb_num_votes <int>, critics_rating <chr>, critics_score <dbl>,
## #   audience_rating <chr>, audience_score <dbl>, best_pic_nom <chr>,
## #   best_pic_win <chr>, best_actor_win <chr>, best_actress_win <chr>,
## #   best_dir_win <chr>, top200_box <chr>, director <chr>, actor1 <chr>,
## #   actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>,
## #   imdb_url <chr>, rt_url <chr>

Remove some columns to make life a title easier

movies <- movies %>% 
       select(title, runtime,genre, mpaa_rating, thtr_rel_year, imdb_rating, imdb_num_votes, critics_score, audience_score, best_pic_win, best_actor_win)


The GGally::ggpairs() functions have nice 2 dimensional slice visualizations.

movies %>%
           critics_score, audience_score) %>% 

Play around with with the ggpairs function to explore pairwise relationships between some of the variables.

Linear regression

Use ggplot to plot the linear regression line for imdb_rating ~ critics_score.


Use the lm function to fit the following liner regression imdb_rating ~ critics_score + audience_score + imdb_num_votes

Prediction task: what’s the best model to predict whether or not the best actor will win an Oscar?

The point of this section is to fit several models and compare them on a test set. Specifically we are going to fit the following models

First, turn the categorical variables into dummy variables. Hint: pd.get_dummies. You data frame should only have numbers in it now.


Split the data into a training and test set (80% train, 20% test). Since the classes are fairly unbalanced you should use stratified sampling. Make two new data frames (called train and test). Fit all of the following models with the training data.


Train models

Each of the models listed above comes with one or more tuning parameter. First hand code a double for loop that performed cross validation for L2 regularized logistic regression.


For the rest of the models use the caret package to tune each model with cross-validation. Here is a list of models supported by caret


Test set

Compute the test set error for each classifier.