This tutorial will give you practice with statistical modeling. Specifically
ISLR is a good reference for all of these topics (4.3, 5.1, 6.2, 9.1-9.3, 10.2)
Prerequisites: this tutorial assumes you are familiar with the above models (or are willing to learn about them). The questions are fairly independent so you can skip parts you don't like/understand. This tutorial also assumes you are familiar with pandas and numpy.
This tutorial uses the following python packages (you should install them at the beginning)
Warning: we are working with pandas data frames but most of the modeling packages (statsmodels and sklearn) do not take pandas data frames as input. You will have to figure out the right input format (usually a numpy array).
# these packages come with Anaconda
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# this makes figures from matplotlib display in the notebook
%matplotlib inline
The data include 651 randomly selected movies scraped from the IMDb and Rotten Tomatoes websites. The data were generously provided by Mine Cetinkaya-Rundel and you can find the original data set on her website.
# read in the data set from Iain's github
movies = pd.read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/movies.csv')
# index by movie title
movies = movies.set_index('title')
# add in a missing value
movies.loc['The End of America', 'runtime'] = 73
movies.head()
# Let's focus on a few columns
movies = movies[['runtime','genre', 'mpaa_rating',
'thtr_rel_year', 'imdb_rating', 'imdb_num_votes',
'critics_score', 'audience_score', 'best_actor_win']]
We can't visualize more than two or three dimensions at a time, but data typically come in much higher dimensions. One way to take a peak at high dimensional data is to make lots of one and two dimensional slices.
Make the plot described below for the following four variables: audience_score, critics_score, and imdb_rating. (See the ggpairs function from GGally for a similar example).
Make a 3 x 3 grid of plots. The diagonal should be a histogram (or KDE) of each variable. The off diagonal should be a scatter plot of the ith variable vs. the jth variable. You can make this grid "upper diagonal" (i.e. you don't need to plot i vs. j if you already have j vs. i).
Make the PC component plots. Get the data matrix X for the following 6 variables: runtime, thtr_rel_year, imdb_num_votes, critics_score, audience_score.
Compute the PCA of X. First column center X and standardize the columns of X by the standard deviation. Then let X = U D V^T be the Singular Value Decomposition of X (use numpy's SVD function). Let $U_k$ be the first k columns of the scores matrix U D (recall the columns are ordered by the increasing eigenvalues).
Let k = 3. Make the above grid plot for for the scores (i.e. the 3 columns of $U_k$ are now the variables).
Fit a simple logistic regression model best_actor_win ~ imdb_rating. Use the logistic regression function from statsmodels.
What is the interpretation of the imdb_rating coefficient?
(Bonus): make one of these plots with imdb_rating on the x axis and the predicted probabilty on the y axis (including the data points).
Again fit a logistic regression model, but this time include all of the covariates. Hint: You need to deal with categorical variables -- change these into dummy variables.
Discuss what is the interpretation of these coefficients? Can we make any make any conclusions about causal ralationships? If not, how would you try to get at causation (i.e. study design)?
The point of this section is to fit several models and compare them on a test set. Specifically we are ging to fit the following models
Hint: pd.get_dummies. You data frame should only have numbers in it now.
Split the data into a training and test set (80% train, 20% test). Since the classes are fairly unbalanced you should use stratified sampling. Make two new data frames (called train and test).
Fit all of the following models with the training data.
Use the Logistic Regression object from sklearn.
# parameters for cross validation
K = 5
C_values = [10 ** m for m in range(-3, 3, 1)]
cv_error_matrix = np.zeros((K, len(C_values)))
fill in the following code for cross-validation
for k in range(K):
# split into train/test folds
# train_cv =
# test_cv =
for i in range(len(C_values)):
c = C_values[i]
# fit logistic regression on train_cv with cost value c
# logreg_cv =
# use logreg_cv to get the test set error on test_cv
# cv_err =
# cv_error_matrix[k, i] = cv_err
Plot the cv-error vs. C. Show each a scatter plot of each point and the mean cv error for each value of C.
Decide on the best value of C. Now train the model on the full training data with the value of C that you settled on.
This should be mostly a matter of copying the above code.
Compare the coefficients, $\beta$ of L1 vs. L2 regularized logistic regression. What do you notice?
This time use the GridSearchCV function from sklearn. Here is some exmaple code.
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV # you might have to update sklearn
# small example data set
data = movies[['runtime', 'best_actor_win', 'imdb_rating']]
# binarize best_actor_win
data.loc[:, 'best_actor_win'] = movies.best_actor_win.eq("yes").astype(int)
# initialize the model
svc = SVC()
# cross validation parameters
params = {
'C': [.01, .1, 1, 10, 100],
'gamma': list(np.arange(0, 1, .1)),
'kernel': ['rbf']
}
# initialize grid serach object
gs_cv = GridSearchCV(estimator=svc, param_grid=params, cv=5, verbose=1)
# fit the grid search. Notice the formats of each input!
gs_results = gscv.fit(data[['runtime', 'imdb_rating']].as_matrix(),
data['best_actor_win'].reshape(-1,))
# now use the gs_results object to find the optimal values of the tuning parameters
gs_results