Modeling and visualization in Python¶

This tutorial will give you practice with statistical modeling. Specifically

test/training set
cross-validation
logistic regression
L1/L2 regularized logistic regression
Support Vector Machine
Kernel SVM
PCA

ISLR is a good reference for all of these topics (4.3, 5.1, 6.2, 9.1-9.3, 10.2)

Prerequisites: this tutorial assumes you are familiar with the above models (or are willing to learn about them). The questions are fairly independent so you can skip parts you don't like/understand. This tutorial also assumes you are familiar with pandas and numpy.

This tutorial uses the following python packages (you should install them at the beginning)

statsmodels
a few packages that come with anaconda: pandas, numpy, matplotlib, sklearn

Warning: we are working with pandas data frames but most of the modeling packages (statsmodels and sklearn) do not take pandas data frames as input. You will have to figure out the right input format (usually a numpy array).

# these packages come with Anaconda
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# this makes figures from matplotlib display in the notebook
%matplotlib inline

The data¶

The data include 651 randomly selected movies scraped from the IMDb and Rotten Tomatoes websites. The data were generously provided by Mine Cetinkaya-Rundel and you can find the original data set on her website.

# read in the data set from Iain's github
movies = pd.read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/movies.csv')

# index by movie title
movies = movies.set_index('title')

# add in a missing value
movies.loc['The End of America', 'runtime'] = 73

movies.head()

# Let's focus on a few columns
movies = movies[['runtime','genre', 'mpaa_rating',
                 'thtr_rel_year', 'imdb_rating', 'imdb_num_votes',
                 'critics_score', 'audience_score', 'best_actor_win']]

Visualizing high dimensional data¶

We can't visualize more than two or three dimensions at a time, but data typically come in much higher dimensions. One way to take a peak at high dimensional data is to make lots of one and two dimensional slices.

Make the plot described below for the following four variables: audience_score, critics_score, and imdb_rating. (See the ggpairs function from GGally for a similar example).

Make a 3 x 3 grid of plots. The diagonal should be a histogram (or KDE) of each variable. The off diagonal should be a scatter plot of the ith variable vs. the jth variable. You can make this grid "upper diagonal" (i.e. you don't need to plot i vs. j if you already have j vs. i).

Make the PC component plots. Get the data matrix X for the following 6 variables: runtime, thtr_rel_year, imdb_num_votes, critics_score, audience_score.

Compute the PCA of X. First column center X and standardize the columns of X by the standard deviation. Then let X = U D V^T be the Singular Value Decomposition of X (use numpy's SVD function). Let $U_k$ be the first k columns of the scores matrix U D (recall the columns are ordered by the increasing eigenvalues).

Let k = 3. Make the above grid plot for for the scores (i.e. the 3 columns of $U_k$ are now the variables).

Inference task: how are movie ratings related to best actor winning an Oscar?¶

Fit a simple logistic regression model best_actor_win ~ imdb_rating. Use the logistic regression function from statsmodels.

What is the interpretation of the imdb_rating coefficient?

(Bonus): make one of these plots with imdb_rating on the x axis and the predicted probabilty on the y axis (including the data points).

Again fit a logistic regression model, but this time include all of the covariates. Hint: You need to deal with categorical variables -- change these into dummy variables.

Discuss what is the interpretation of these coefficients? Can we make any make any conclusions about causal ralationships? If not, how would you try to get at causation (i.e. study design)?

Prediction task: what's the best model to predict whether or not the best actor will win an Oscar?¶

The point of this section is to fit several models and compare them on a test set. Specifically we are ging to fit the following models

L1 regularized logistic regression
L2 regularized logistic regression
Support Vector Machine
Kernel Support Vector machine

turn the categorical variables into dummy variables¶

Hint: pd.get_dummies. You data frame should only have numbers in it now.

Make a test set¶

Split the data into a training and test set (80% train, 20% test). Since the classes are fairly unbalanced you should use stratified sampling. Make two new data frames (called train and test).

Fit all of the following models with the training data.

Fit logistic regresssion with L2 regularization using cross-validation¶

Use the Logistic Regression object from sklearn.

# parameters for cross validation
K = 5
C_values = [10 ** m for m in range(-3, 3, 1)]

cv_error_matrix = np.zeros((K, len(C_values)))

fill in the following code for cross-validation

for k in range(K):
    # split into train/test folds
    # train_cv = 
    # test_cv = 
    
    for i in range(len(C_values)):
        c = C_values[i]
        
        # fit logistic regression on train_cv with cost value c
        # logreg_cv = 
        
        # use logreg_cv to get the test set error on test_cv
        # cv_err = 
        
        # cv_error_matrix[k, i] = cv_err

Plot the cv-error vs. C. Show each a scatter plot of each point and the mean cv error for each value of C.

Decide on the best value of C. Now train the model on the full training data with the value of C that you settled on.

Use cross validation to fit logistic regression with L1 regularization¶

This should be mostly a matter of copying the above code.

Compare the coefficients, $\beta$ of L1 vs. L2 regularized logistic regression. What do you notice?

GridSearchCv example¶

This time use the GridSearchCV function from sklearn. Here is some exmaple code.

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV # you might have to update sklearn

# small example data set
data = movies[['runtime', 'best_actor_win', 'imdb_rating']]

# binarize best_actor_win
data.loc[:, 'best_actor_win'] = movies.best_actor_win.eq("yes").astype(int)

# initialize the model
svc = SVC()

# cross validation parameters
params = {
    'C': [.01, .1, 1, 10, 100],
    'gamma': list(np.arange(0, 1, .1)),
    'kernel': ['rbf']
}

# initialize grid serach object
gs_cv = GridSearchCV(estimator=svc, param_grid=params, cv=5, verbose=1)

# fit the grid search. Notice the formats of each input!
gs_results = gscv.fit(data[['runtime', 'imdb_rating']].as_matrix(),
                      data['best_actor_win'].reshape(-1,))

Fitting 5 folds for each of 50 candidates, totalling 250 fits

[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    3.5s finished

# now use the gs_results object to find the optimal values of the tuning parameters
gs_results

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'kernel': ['rbf'], 'C': [0.01, 0.1, 1, 10, 100], 'gamma': [0.0, 0.10000000000000001, 0.20000000000000001, 0.30000000000000004, 0.40000000000000002, 0.5, 0.60000000000000009, 0.70000000000000007, 0.80000000000000004, 0.90000000000000002]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

Use GridSearchCV to tune linear SVM¶

Use LinearSVC from SKlearn

Use GridSearchCV to tune Kernel SVM¶

Use the SVC function from sklearn.

Try both a polynomial kernel ('poly') and a radial basis kernel ('rbf'). Use a range of C and gamma parameters.

Compute the test error for each of the classifiers¶

	title_type	genre	runtime	mpaa_rating	studio	thtr_rel_year	thtr_rel_month	thtr_rel_day	dvd_rel_year	dvd_rel_month	...	best_dir_win	top200_box	director	actor1	actor2	actor3	actor4	actor5	imdb_url	rt_url
title
Filly Brown	Feature Film	Drama	80.0	R	Indomina Media Inc.	2013.0	4.0	19.0	2013.0	7.0	...	no	no	Michael D. Olmos	Gina Rodriguez	Jenni Rivera	Lou Diamond Phillips	Emilio Rivera	Joseph Julian Soria	http://www.imdb.com/title/tt1869425/	//www.rottentomatoes.com/m/filly_brown_2012/
The Dish	Feature Film	Drama	101.0	PG-13	Warner Bros. Pictures	2001.0	3.0	14.0	2001.0	8.0	...	no	no	Rob Sitch	Sam Neill	Kevin Harrington	Patrick Warburton	Tom Long	Genevieve Mooy	http://www.imdb.com/title/tt0205873/	//www.rottentomatoes.com/m/dish/
Waiting for Guffman	Feature Film	Comedy	84.0	R	Sony Pictures Classics	1996.0	8.0	21.0	2001.0	8.0	...	no	no	Christopher Guest	Christopher Guest	Catherine O'Hara	Parker Posey	Eugene Levy	Bob Balaban	http://www.imdb.com/title/tt0118111/	//www.rottentomatoes.com/m/waiting_for_guffman/
The Age of Innocence	Feature Film	Drama	139.0	PG	Columbia Pictures	1993.0	10.0	1.0	2001.0	11.0	...	yes	no	Martin Scorsese	Daniel Day-Lewis	Michelle Pfeiffer	Winona Ryder	Richard E. Grant	Alec McCowen	http://www.imdb.com/title/tt0106226/	//www.rottentomatoes.com/m/age_of_innocence/
Malevolence	Feature Film	Horror	90.0	R	Anchor Bay Entertainment	2004.0	9.0	10.0	2005.0	4.0	...	no	no	Stevan Mena	Samantha Dark	R. Brandon Johnson	Brandon Johnson	Heather Magee	Richard Glover	http://www.imdb.com/title/tt0388230/	//www.rottentomatoes.com/m/10004684-malevolence/