Data manipulation and visualization in Python

This tutorial will give you practice with

  • data visualization with matplotlib
  • data manipulation with pandas

Python for Data Analysis is a good Python reference. My favorite quick pandas resources is Chris Albon's website. This matplotlib tutorial is pretty decent. As always Google is your best friend, but here are some other python data science references

Pandas can be a little weird at first because there are several ways to subset the data and rows come with indexing.

If you are already comfortable with matplotlib you might try the seaborn package.

In [2]:
# these packages come with Anaconda
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# this makes figures from matplotlib display in the notebook
%matplotlib inline

The data

The data include 651 randomly selected movies scraped from the IMDb and Rotten Tomatoes websites. The data were generously provided by Mine Cetinkaya-Rundel and you can find the original data set on her website.

In [3]:
# read in the data set from Iain's github
movies = pd.read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/movies.csv')

# index by movie title
movies = movies.set_index('title')

movies.head()
In [ ]:
movies.columns

Visualization

Answer all the following questions with matplotlib i.e.

In [8]:
plt.scatter(movies['imdb_num_votes'], movies['imdb_rating'])

# change axis labels
plt.xlabel('number of votes')
plt.ylabel('imdb rating')

# set axis limits
plt.xlim([0, max(movies['imdb_num_votes'])])
plt.ylim([0, max(movies['imdb_rating'])])

# add a title
plt.title('number of votes vs. imdb rating')
Out[8]:
<matplotlib.text.Text at 0x11b4ad450>

Make a histogram of imdb_rating.

In [ ]:
 

Make the above histogram with 100 bins.

In [ ]:
 

Make a scatter plot comparing Rotten Tomatoes critic score vs. imdb ratings. Change the x/y axis labels to something nicer and add a title.

In [ ]:
 

(BONUS) Make the same rt vs. imdb scatter plot as above but facet by mpaa_ratings. (http://stackoverflow.com/questions/34762280/how-to-plot-facet-gridggplot-in-r-in-python)

In [ ]:
 

Again make the same rt vs. imdb scatter plot but color the points by mpaa_ratings.

In [ ]:
 

One last time make the rt vs. imdb scatter plot but now try including runtime as a third variable using point

  • color
  • size
  • alpha

Which one of these is “best”?

Data transformation

Subset the columns of movies to keep the following variables: runtime, genre, mpaa_rating, thtr_rel_year, imdb_rating, imdb_num_votes, critics_score, audience_score, and best_pic_win. Make sure to update the movies data frame.

In [ ]:
 

Compute the mean of each continuous variable. Hint: you'll have to deal with a missing value in runtime.

In [ ]:
 

Which movie is missing the runtime?

In [ ]:
 

Google this filem ad manually add the runtime.

In [ ]:
 

Compute the mean imdb rating for movies by genere. Hint: use groupby

In [ ]:
 

Similarly, compute the mean number of imdb votes for each mpaa_rating category then plot the mean ratings.

In [ ]:
 

Compute the compare the average imdb rating of movies longer than 100 minutes to that of movies shorter than 100 minutes. The resulting printed out data frame should only have two columns.

In [ ]: