This tutorial will give you practice with
ggplot2
dplyr
You can find every thing you need to know in the (free) R for Data Science textbook by Hadley Wickham (primarily chapters 3 and 5).This tutorial assumes you have some familiarity with R (though not strictly necessary). If you already good with base R this tutorial is a good way to learn the tidyverse which you should use.
Hints
%>%
(also called chaining)dplyr
for subsettingggplot
for plottingtibble
instead of data.frame
(warning: I refer to tibble
as a data frame)I’ve always wanted to compare IMDb and Rotten Tomatoes ratings. The data include 651 randomly selected movies scraped from the IMDb and Rotten Tomatoes websites. The data were generously provided by Mine Cetinkaya-Rundel and you can find the original data set on her website.
# install.packages('tidyverse')
library(tidyverse)
# read the data into R from Iain's github
movies <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/movies.csv')
The first think you should do when you see a new data set is look at it.
head(movies)
## # A tibble: 6 × 32
## title title_type genre runtime mpaa_rating
## <chr> <chr> <chr> <dbl> <chr>
## 1 Filly Brown Feature Film Drama 80 R
## 2 The Dish Feature Film Drama 101 PG-13
## 3 Waiting for Guffman Feature Film Comedy 84 R
## 4 The Age of Innocence Feature Film Drama 139 PG
## 5 Malevolence Feature Film Horror 90 R
## 6 Old Partner Documentary Documentary 78 Unrated
## # ... with 27 more variables: studio <chr>, thtr_rel_year <dbl>,
## # thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## # dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## # imdb_num_votes <int>, critics_rating <chr>, critics_score <dbl>,
## # audience_rating <chr>, audience_score <dbl>, best_pic_nom <chr>,
## # best_pic_win <chr>, best_actor_win <chr>, best_actress_win <chr>,
## # best_dir_win <chr>, top200_box <chr>, director <chr>, actor1 <chr>,
## # actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>,
## # imdb_url <chr>, rt_url <chr>
A couple other functions that are useful for a first look.
dim(movies)
colnames(movies)
str(movies)
summary(movies)
You can also check out R Studio’s spreadsheet view.
Answer all the questions using ggplot
. The ggplot
syntax is a little weird, especially if you are used to base R. In order to use ggplot
your data frame has to be in a data frame object. You can read about ggplot
in r4ds chapter 3, but here is a first example
ggplot(data=movies) +
geom_point(aes(x=imdb_num_votes, y=imdb_rating))
Make a histogram of imdb_rating. Hint: geom_histogram
.
#
Make the above histogram with 100 bins.
#
Make a scatter plot comparing Rotten Tomatoes critic score vs. imdb ratings. Hint: geom_point
. Change the x/y axis labels to something nicer and add a title.
#
Make the same rt vs. imdb scatter plot as above but facet by mpaa_ratings.
#
Again make the same rt vs. imdb scatter plot but color the points by mpaa_ratings.
#
One last time make the rt vs. imdb scatter plot but now try including runtime as a third variable using point
Which one of these is “best”?
#
Use the dplyr
package to answer the following questions. Use the pipe %>%
operator for long strings of commands.
Use the select
function to keep the following variables: runtime, genre, mpaa_rating, thtr_rel_year, imdb_rating, imdb_num_votes, critics_score, audience_score, and best_pic_win. Make sure to update the movies data frame.
#
Use select
then summarise_all
to compute the mean of each continuous variable (what is the difference between summarise
and summarise_all
?)
#
Oops! The mean of runtime is NA
because one of its values is NA
. Looks like IMDb is missing one of the run times. Modify the above code to compute the mean ignoring missing values. Hint: use an anonoymous
function.
# you'll find this piece of code helpful
# summarise_all(function(x) mean(x, na.rm=T))
Which movie is missing the runtime? Hint: use filter
and na.rm
.
#
Google this film and manually add the runtime using base R.
#
Use group_by
then summarise
to compute the mean imdb rating for movies by genre.
#
Similarly, compute the mean number of imdb votes for each mpaa_rating category then plot the mean ratings. Hint: you will need to use stat='identity'
.
#
Compute the compare the average imdb rating of movies longer than 100 minutes to that of movies shorter than 100 minutes. The resulting printed out data frame should only have two columns.
#