Modeling and visualization in Python

This tutorial will give you practice with statistical modeling. Specifically

  • test/training set
  • cross-validation
  • logistic regression
  • L1/L2 regularized logistic regression
  • Support Vector Machine
  • Kernel SVM
  • PCA

ISLR is a good reference for all of these topics (4.3, 5.1, 6.2, 9.1-9.3, 10.2)

Prerequisites: this tutorial assumes you are familiar with the above models (or are willing to learn about them). The questions are fairly independent so you can skip parts you don't like/understand. This tutorial also assumes you are familiar with pandas and numpy.

This tutorial uses the following python packages (you should install them at the beginning)

Warning: we are working with pandas data frames but most of the modeling packages (statsmodels and sklearn) do not take pandas data frames as input. You will have to figure out the right input format (usually a numpy array).

In [1]:
# these packages come with Anaconda
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# this makes figures from matplotlib display in the notebook
%matplotlib inline

The data

The data include 651 randomly selected movies scraped from the IMDb and Rotten Tomatoes websites. The data were generously provided by Mine Cetinkaya-Rundel and you can find the original data set on her website.

In [2]:
# read in the data set from Iain's github
movies = pd.read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/movies.csv')

# index by movie title
movies = movies.set_index('title')

# add in a missing value
movies.loc['The End of America', 'runtime'] = 73

movies.head()
Out[2]:
title_type genre runtime mpaa_rating studio thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month ... best_dir_win top200_box director actor1 actor2 actor3 actor4 actor5 imdb_url rt_url
title
Filly Brown Feature Film Drama 80.0 R Indomina Media Inc. 2013.0 4.0 19.0 2013.0 7.0 ... no no Michael D. Olmos Gina Rodriguez Jenni Rivera Lou Diamond Phillips Emilio Rivera Joseph Julian Soria http://www.imdb.com/title/tt1869425/ //www.rottentomatoes.com/m/filly_brown_2012/
The Dish Feature Film Drama 101.0 PG-13 Warner Bros. Pictures 2001.0 3.0 14.0 2001.0 8.0 ... no no Rob Sitch Sam Neill Kevin Harrington Patrick Warburton Tom Long Genevieve Mooy http://www.imdb.com/title/tt0205873/ //www.rottentomatoes.com/m/dish/
Waiting for Guffman Feature Film Comedy 84.0 R Sony Pictures Classics 1996.0 8.0 21.0 2001.0 8.0 ... no no Christopher Guest Christopher Guest Catherine O'Hara Parker Posey Eugene Levy Bob Balaban http://www.imdb.com/title/tt0118111/ //www.rottentomatoes.com/m/waiting_for_guffman/
The Age of Innocence Feature Film Drama 139.0 PG Columbia Pictures 1993.0 10.0 1.0 2001.0 11.0 ... yes no Martin Scorsese Daniel Day-Lewis Michelle Pfeiffer Winona Ryder Richard E. Grant Alec McCowen http://www.imdb.com/title/tt0106226/ //www.rottentomatoes.com/m/age_of_innocence/
Malevolence Feature Film Horror 90.0 R Anchor Bay Entertainment 2004.0 9.0 10.0 2005.0 4.0 ... no no Stevan Mena Samantha Dark R. Brandon Johnson Brandon Johnson Heather Magee Richard Glover http://www.imdb.com/title/tt0388230/ //www.rottentomatoes.com/m/10004684-malevolence/

5 rows × 31 columns

In [3]:
# Let's focus on a few columns
movies = movies[['runtime','genre', 'mpaa_rating',
                 'thtr_rel_year', 'imdb_rating', 'imdb_num_votes',
                 'critics_score', 'audience_score', 'best_actor_win']]

Visualizing high dimensional data

We can't visualize more than two or three dimensions at a time, but data typically come in much higher dimensions. One way to take a peak at high dimensional data is to make lots of one and two dimensional slices.

Make the plot described below for the following four variables: audience_score, critics_score, and imdb_rating. (See the ggpairs function from GGally for a similar example).

Make a 3 x 3 grid of plots. The diagonal should be a histogram (or KDE) of each variable. The off diagonal should be a scatter plot of the ith variable vs. the jth variable. You can make this grid "upper diagonal" (i.e. you don't need to plot i vs. j if you already have j vs. i).

In [ ]:
 

Make the PC component plots. Get the data matrix X for the following 6 variables: runtime, thtr_rel_year, imdb_num_votes, critics_score, audience_score.

Compute the PCA of X. First column center X and standardize the columns of X by the standard deviation. Then let X = U D V^T be the Singular Value Decomposition of X (use numpy's SVD function). Let $U_k$ be the first k columns of the scores matrix U D (recall the columns are ordered by the increasing eigenvalues).

Let k = 3. Make the above grid plot for for the scores (i.e. the 3 columns of $U_k$ are now the variables).

In [ ]:
 

Inference task: how are movie ratings related to best actor winning an Oscar?

Fit a simple logistic regression model best_actor_win ~ imdb_rating. Use the logistic regression function from statsmodels.

What is the interpretation of the imdb_rating coefficient?

(Bonus): make one of these plots with imdb_rating on the x axis and the predicted probabilty on the y axis (including the data points).

In [ ]:
 

Again fit a logistic regression model, but this time include all of the covariates. Hint: You need to deal with categorical variables -- change these into dummy variables.

In [ ]:
 

Discuss what is the interpretation of these coefficients? Can we make any make any conclusions about causal ralationships? If not, how would you try to get at causation (i.e. study design)?

Prediction task: what's the best model to predict whether or not the best actor will win an Oscar?

The point of this section is to fit several models and compare them on a test set. Specifically we are ging to fit the following models

  • L1 regularized logistic regression
  • L2 regularized logistic regression
  • Support Vector Machine
  • Kernel Support Vector machine

turn the categorical variables into dummy variables

Hint: pd.get_dummies. You data frame should only have numbers in it now.

In [ ]:
 

Make a test set

Split the data into a training and test set (80% train, 20% test). Since the classes are fairly unbalanced you should use stratified sampling. Make two new data frames (called train and test).

Fit all of the following models with the training data.

In [ ]:
 

Fit logistic regresssion with L2 regularization using cross-validation

Use the Logistic Regression object from sklearn.

In [37]:
# parameters for cross validation
K = 5
C_values = [10 ** m for m in range(-3, 3, 1)]

cv_error_matrix = np.zeros((K, len(C_values)))

fill in the following code for cross-validation

In [39]:
for k in range(K):
    # split into train/test folds
    # train_cv = 
    # test_cv = 
    
    for i in range(len(C_values)):
        c = C_values[i]
        
        # fit logistic regression on train_cv with cost value c
        # logreg_cv = 
        
        # use logreg_cv to get the test set error on test_cv
        # cv_err = 
        
        # cv_error_matrix[k, i] = cv_err

Plot the cv-error vs. C. Show each a scatter plot of each point and the mean cv error for each value of C.

In [ ]:
 

Decide on the best value of C. Now train the model on the full training data with the value of C that you settled on.

In [ ]:
 

Use cross validation to fit logistic regression with L1 regularization

This should be mostly a matter of copying the above code.

Compare the coefficients, $\beta$ of L1 vs. L2 regularized logistic regression. What do you notice?

In [ ]:
 

GridSearchCv example

This time use the GridSearchCV function from sklearn. Here is some exmaple code.

In [36]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV # you might have to update sklearn

# small example data set
data = movies[['runtime', 'best_actor_win', 'imdb_rating']]

# binarize best_actor_win
data.loc[:, 'best_actor_win'] = movies.best_actor_win.eq("yes").astype(int)

# initialize the model
svc = SVC()

# cross validation parameters
params = {
    'C': [.01, .1, 1, 10, 100],
    'gamma': list(np.arange(0, 1, .1)),
    'kernel': ['rbf']
}

# initialize grid serach object
gs_cv = GridSearchCV(estimator=svc, param_grid=params, cv=5, verbose=1)

# fit the grid search. Notice the formats of each input!
gs_results = gscv.fit(data[['runtime', 'imdb_rating']].as_matrix(),
                      data['best_actor_win'].reshape(-1,))
Fitting 5 folds for each of 50 candidates, totalling 250 fits
[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    3.5s finished
In [37]:
# now use the gs_results object to find the optimal values of the tuning parameters
gs_results
Out[37]:
GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'kernel': ['rbf'], 'C': [0.01, 0.1, 1, 10, 100], 'gamma': [0.0, 0.10000000000000001, 0.20000000000000001, 0.30000000000000004, 0.40000000000000002, 0.5, 0.60000000000000009, 0.70000000000000007, 0.80000000000000004, 0.90000000000000002]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

Use GridSearchCV to tune linear SVM

Use LinearSVC from SKlearn

In [ ]:
 

Use GridSearchCV to tune Kernel SVM

Use the SVC function from sklearn.

Try both a polynomial kernel ('poly') and a radial basis kernel ('rbf'). Use a range of C and gamma parameters.

In [ ]:
 
In [ ]:
 
In [ ]:
 

Compute the test error for each of the classifiers

In [ ]: