This tutorial will give you practice with

ISLR is a good reference for all of these topics (4.3, 5.1, 6.2, 9.1-9.3, 10.2)

Prerequisites: this tutorial assumes you are familiar with the above models (or are willing to learn about them). The questions are fairly independent so you can skip parts you don’t like/understand. This tutorial also assumes you are familiar with R.

Hints: I suggest writing all of the code in an R script then transferring the answers to a .Rmd file.

The data

The data include 651 randomly selected movies scraped from the IMDb and Rotten Tomatoes websites. The data were generously provided by Mine Cetinkaya-Rundel and you can find the original data set on her website.

# if you don't have these packages then install them
# install.packages('tidyverse')
# install.packages('GGally')

library(tidyverse)
library(GGally)
movies <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/movies.csv')

# fix a missing value
movies[movies[, 'title' ] == 'The End of America', 'runtime'] <- 73

Take a first look at the data

head(movies)
## # A tibble: 6 × 32
##                  title   title_type       genre runtime mpaa_rating
##                  <chr>        <chr>       <chr>   <dbl>       <chr>
## 1          Filly Brown Feature Film       Drama      80           R
## 2             The Dish Feature Film       Drama     101       PG-13
## 3  Waiting for Guffman Feature Film      Comedy      84           R
## 4 The Age of Innocence Feature Film       Drama     139          PG
## 5          Malevolence Feature Film      Horror      90           R
## 6          Old Partner  Documentary Documentary      78     Unrated
## # ... with 27 more variables: studio <chr>, thtr_rel_year <dbl>,
## #   thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## #   dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## #   imdb_num_votes <int>, critics_rating <chr>, critics_score <dbl>,
## #   audience_rating <chr>, audience_score <dbl>, best_pic_nom <chr>,
## #   best_pic_win <chr>, best_actor_win <chr>, best_actress_win <chr>,
## #   best_dir_win <chr>, top200_box <chr>, director <chr>, actor1 <chr>,
## #   actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>,
## #   imdb_url <chr>, rt_url <chr>

Remove some columns to make life a title easier

movies <- movies %>% 
       select(title, runtime,genre, mpaa_rating, thtr_rel_year, imdb_rating, imdb_num_votes, critics_score, audience_score, best_pic_win, best_actor_win)

Visualization

The GGally::ggpairs() functions have nice 2 dimensional slice visualizations.

movies %>%
    select(imdb_rating,imdb_num_votes,
           critics_score, audience_score) %>% 
    ggpairs()

Play around with with the ggpairs function to explore pairwise relationships between some of the variables.

Linear regression

Use ggplot to plot the linear regression line for imdb_rating ~ critics_score.

#

Use the lm function to fit the following liner regression imdb_rating ~ critics_score + audience_score + imdb_num_votes

Prediction task: what’s the best model to predict whether or not the best actor will win an Oscar?

The point of this section is to fit several models and compare them on a test set. Specifically we are going to fit the following models

First, turn the categorical variables into dummy variables. Hint: pd.get_dummies. You data frame should only have numbers in it now.

#

Split the data into a training and test set (80% train, 20% test). Since the classes are fairly unbalanced you should use stratified sampling. Make two new data frames (called train and test). Fit all of the following models with the training data.

#

Train models

Each of the models listed above comes with one or more tuning parameter. First hand code a double for loop that performed cross validation for L2 regularized logistic regression.

#

For the rest of the models use the caret package to tune each model with cross-validation. Here is a list of models supported by caret

#

Test set

Compute the test set error for each classifier.