This tutorial will give you practice with
ISLR is a good reference for all of these topics (4.3, 5.1, 6.2, 9.1-9.3, 10.2)
Prerequisites: this tutorial assumes you are familiar with the above models (or are willing to learn about them). The questions are fairly independent so you can skip parts you don’t like/understand. This tutorial also assumes you are familiar with R.
Hints: I suggest writing all of the code in an R script then transferring the answers to a .Rmd file.
The data include 651 randomly selected movies scraped from the IMDb and Rotten Tomatoes websites. The data were generously provided by Mine Cetinkaya-Rundel and you can find the original data set on her website.
# if you don't have these packages then install them
# install.packages('tidyverse')
# install.packages('GGally')
library(tidyverse)
library(GGally)
movies <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/movies.csv')
# fix a missing value
movies[movies[, 'title' ] == 'The End of America', 'runtime'] <- 73
Take a first look at the data
head(movies)
## # A tibble: 6 × 32
## title title_type genre runtime mpaa_rating
## <chr> <chr> <chr> <dbl> <chr>
## 1 Filly Brown Feature Film Drama 80 R
## 2 The Dish Feature Film Drama 101 PG-13
## 3 Waiting for Guffman Feature Film Comedy 84 R
## 4 The Age of Innocence Feature Film Drama 139 PG
## 5 Malevolence Feature Film Horror 90 R
## 6 Old Partner Documentary Documentary 78 Unrated
## # ... with 27 more variables: studio <chr>, thtr_rel_year <dbl>,
## # thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## # dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## # imdb_num_votes <int>, critics_rating <chr>, critics_score <dbl>,
## # audience_rating <chr>, audience_score <dbl>, best_pic_nom <chr>,
## # best_pic_win <chr>, best_actor_win <chr>, best_actress_win <chr>,
## # best_dir_win <chr>, top200_box <chr>, director <chr>, actor1 <chr>,
## # actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>,
## # imdb_url <chr>, rt_url <chr>
Remove some columns to make life a title easier
movies <- movies %>%
select(title, runtime,genre, mpaa_rating, thtr_rel_year, imdb_rating, imdb_num_votes, critics_score, audience_score, best_pic_win, best_actor_win)
The GGally::ggpairs()
functions have nice 2 dimensional slice visualizations.
movies %>%
select(imdb_rating,imdb_num_votes,
critics_score, audience_score) %>%
ggpairs()
Play around with with the ggpairs
function to explore pairwise relationships between some of the variables.
Use ggplot
to plot the linear regression line for imdb_rating ~ critics_score.
#
Use the lm
function to fit the following liner regression imdb_rating ~ critics_score + audience_score + imdb_num_votes
The point of this section is to fit several models and compare them on a test set. Specifically we are going to fit the following models
First, turn the categorical variables into dummy variables. Hint: pd.get_dummies. You data frame should only have numbers in it now.
#
Split the data into a training and test set (80% train, 20% test). Since the classes are fairly unbalanced you should use stratified sampling. Make two new data frames (called train and test). Fit all of the following models with the training data.
#
Each of the models listed above comes with one or more tuning parameter. First hand code a double for loop that performed cross validation for L2 regularized logistic regression.
#
For the rest of the models use the caret
package to tune each model with cross-validation. Here is a list of models supported by caret
#
Compute the test set error for each classifier.