Stuff to install

R

Download the latest version of R from the Comprehensive R Archive Network (CRAN). R is a programming language built for statistical analysis.

R Studio

Download R Studio which is an IDE built for R. While you can use R without R Studio, R Studio makes life much better.

Watch the video on this page for the basics parts of R Studio

Updating R/R Studio

If you already have R and R Studio please update both of them. For instructions see this page.

Basic R commands

You downloaded base R from CRAN.

Run the following in your R console

1 + 1
## [1] 2
a <- 1
b <- 2
a + b
## [1] 3

If you are new to R I suggest reading through before we start and intro to R.

Packages

The power of R comes from the many wonderful R packages people develop. R is an open source language meaning anyone can develop a new R package.

You can install a package from CRAN like this

install.packages("tidyverse")

There are other sources of R packages such as Bioconductor and Github. Packages from CRAN and Bioconductor are vetted (though not perfectly). Packages on Github are not.

To use code from an R package you need to load it

# ignore the warnings for now
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

You need to load an R package every time you want to use it. You only need to install it once.

IMBD

Load the movies data set generously curated by Mine Cetinkaya-Rundel

# downloads data set and loads it into R
load(url('https://stat.duke.edu/~mc301/data/movies.Rdata'))

The first thing you should do when you get a data set is look at it!

Numerical summaries

str() tells you about the data frame. First thing to note is the dimension of the data frame (651 rows by 32 columns)and the column types

str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

head() prints the first six rows of a data set (and as many columns that will fit on the screen)

head(movies)
## # A tibble: 6 × 32
##                  title   title_type       genre runtime mpaa_rating
##                  <chr>       <fctr>      <fctr>   <dbl>      <fctr>
## 1          Filly Brown Feature Film       Drama      80           R
## 2             The Dish Feature Film       Drama     101       PG-13
## 3  Waiting for Guffman Feature Film      Comedy      84           R
## 4 The Age of Innocence Feature Film       Drama     139          PG
## 5          Malevolence Feature Film      Horror      90           R
## 6          Old Partner  Documentary Documentary      78     Unrated
## # ... with 27 more variables: studio <fctr>, thtr_rel_year <dbl>,
## #   thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## #   dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## #   imdb_num_votes <int>, critics_rating <fctr>, critics_score <dbl>,
## #   audience_rating <fctr>, audience_score <dbl>, best_pic_nom <fctr>,
## #   best_pic_win <fctr>, best_actor_win <fctr>, best_actress_win <fctr>,
## #   best_dir_win <fctr>, top200_box <fctr>, director <chr>, actor1 <chr>,
## #   actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>,
## #   imdb_url <chr>, rt_url <chr>

If you double click a data frame it will pull up R’s built in spreadsheet

summary() prints out some descriptive statistics of each column

summary(movies)
##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year  thtr_rel_month   thtr_rel_day    dvd_rel_year 
##  Min.   :1970   Min.   : 1.00   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.00   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.74   Mean   :14.42   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :2015  
##                                                 NA's   :8     
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   180  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 15116  
##  Mean   : 6.333   Mean   :15.01   Mean   :6.493   Mean   : 57533  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58300  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##  NA's   :8        NA's   :8                                       
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :636    Length:651         Length:651         Length:651        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url         
##  Length:651         Length:651        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

The $ sign after a data frame will return a column

movies$imdb_num_votes
##   [1]    899  12285  22381  35096   2386    333   5016   2272    880  12496
##  [11]  71979   9669 201779  25808   5544 240033  66489   6336  37769  21268
##  [21]  56201   3459  16717   9357   4541   1816 163490  19285   3688   3488
##  [31]  82851  56128  33101 259822   4908 290356  47297   2145 466400  69338
##  [41]   6228  75468  48519  13523  58668   9654  22079   5258    761  33839
##  [51]    183 225130  62241  19714 137126  10468  37770   1628   5587   6247
##  [61]   8646   8320   2362   1942 110238  21501  14359  12450  21704   4375
##  [71]  48324   4768   2944   6788  24783   3745  54726   1493   5616  10492
##  [81]   4451  12269  60335   5035 315051  82737   4944   8229   4516   7858
##  [91]  10938 279704  11192   3851   1838   6552    390  68429 182983   1147
## [101]  35635   2569  13980  11477   9853  14970   9001  35868  41767   6954
## [111] 893008   3467 117688   2817  32751   5704    318 123989  70209    541
## [121] 132215 127458  27769  52635  12819   6054   9367  17798  26010   4515
## [131]    535  71572  57933   1784   3883   2282 105745 104457 205065 562136
## [141]   1778   7628  17934   2502  13092   1010   4687  48137 164112  30921
## [151]  37640  95327 289825  15525  43268   3153   1887   3096 112216 154674
## [161]  12535   3363  40659  82378   3135   1489  41385  88523  12221   6304
## [171]   9565  15913 414687  33720    325  22245  17133  23821  13285 192052
## [181] 211129 146518   2239    486   7658   1268  64119  24678   6114   9399
## [191] 414650  15291    725  93331   9876   9525  44741    285   3673  79970
## [201]   1058  60220   5591    340  50340   2934   5564 753592   1141   3967
## [211] 287476  94983  56185   3138  25264   9424  16681   1308  87215 121245
## [221]  71112  72295   2289  84191 235529 168032  56329   2849  26360   5762
## [231]   8521 191935    872   8999 184656   2098 375820  19187   1406  16755
## [241]   9003   5014   2818  38076 137405  56361   7244   2701  18712  25054
## [251]   1510   8544   8561 151934  52449  11001  42613   3505  74294   1480
## [261]    703 103789  42842  26731  26628 149437   9291 157701  14901   1361
## [271]  47065  12606 246587  42208   9216 201787  19000   1799  12877   9025
## [281]   5985  30886   4970   2732   3336 252661  30641  22601   7076 161601
## [291] 285328   2598 108598  20655  12322  99192  44257  77762 368799   2830
## [301]   2960  46233   9980   8016  37938  16480  63219   4251  23697  53675
## [311]   9787   1978   9370  10380  73280   7656  11855 172765  23201   8604
## [321]  14949   1995  15806  30694  56888   4821   3145  19115  70994  66233
## [331]   2295   2698  34652    739  64489  49985   3649   3359  10020  78726
## [341] 246907  49374  40001  18670   5149  35577  38076  87652  17329  16511
## [351]   5002 749783  10599   3101   4180   9904   3428   2056 100447  37506
## [361] 109633  21443   4031  47692  47343  24084  13215 105982  34461  15714
## [371]   3358 275125  13280   2551   5863  73219   8319 265725   3859  53535
## [381]   4907 318019    679 806911   3342   3649  27417   1815   9939   4857
## [391]   1428  44248  18141 303529  86953 128361 297034   4143   3128  36909
## [401]   1816   1043  54829   9725 490295   7881   5136  10055   2408  24472
## [411] 115026    651   7710   5425   4550  16262   1571  10271 204042  16824
## [421]   6061  10651   1346   1886   4874   4121  40133   3416  73617 183747
## [431] 123588 124250  11103  33040  11236   9946   1680 122980  19603   1663
## [441]  71141  13790   5374  14589  11259   6472 100416   3866    872   2928
## [451]   3887   1607   4904  19937  17384  25683   3883  99582   2959 134031
## [461]  17190 135840   2897  10126  19383   8030   3461   3970 572236   4072
## [471] 126257  15491  51070   2530  30495  16955 797101   8059  60483   3602
## [481]  34802   3730  30085  32737  34307  66171   8685  54597   7862  68871
## [491] 582091  11156   6345   9990   3473  42295 329613  42408 137222  51366
## [501]  21623  39320   1915   1674   2096   1935  10522   2380  78862  83724
## [511]  34298    830   2869 134510 152216  54771  11838 110540   6343 309494
## [521]   6909   1890  72176   2271   6804 161101  10535 448434   2931  21009
## [531]   6811   1943  32338  19161  54871   2433  21924 128298  14559  34926
## [541]  27097   4021    764   6418  70737   9656 193702  59076 154148  11377
## [551]   3302    180  34253  17960    281  12498  86831  83424   1803  56919
## [561]  10250  18005  62773  15444  48756  14986  13525 246343 290958   3146
## [571]  10886  96471  17101    723 106171  88777 294683  51534   3487   5115
## [581]  15449   2181   9832 247105  13614  78297   4369   6765  16137 101850
## [591]    504   1935  15116 183717  64873  20738 123769  79866 160237  24595
## [601]  16883    390  19539  48718  26301  26943   4077  63511  27601 756602
## [611]   3998  19898  10786   2857   9675   7545   2113   3448   2441  12402
## [621]   3373   6322   9906  15025  30826 309896   7284  58907  57251   3790
## [631]   8818  11125 675907   2120 111132 103378  13682  63672   6946   3584
## [641]  54363  11197  96787  16366 134270  11657   8345  46794  10087  66054
## [651]  43574

The mean() function computes the mean of a vector. There is also a median, var, min, max

mean(movies$imdb_num_votes)
## [1] 57532.98

Vizualization

You can only learn so much by looking lists of numbers. Let’s make some plots.

There are two popular plotting systems in R. There is the base R system

plot(movies$imdb_rating, movies$critics_score)

and ggplot2.

# ggplot was loaded with tidyverse
ggplot(data = movies) + geom_point(mapping = aes(x = imdb_rating, y = critics_score))

We will use ggplot2 in this course (see readings below about ggplot2 vs base). ggplot2 can be a bit intimidating at first – especially if you are used to base plotting.

ggplot(data=movies, aes(x=audience_score)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot2 has a ton of functionality built it and you will learn to love it when you get used to it.

ggplot(data = movies) + geom_point(mapping = aes(x = imdb_rating, y = critics_score, color=mpaa_rating))

Getting help

The main textbook for this class is R for Data Science written by Hadley Wickham (it’s free online). I have put up a long list of alternative resources (textbooks, coursera courses, etc).

Google and StackOverflow will become your best friends. If you have a question, chances are someone has already asked and answered it. If R gives you an error message you don’t understand google it – someone else has probably figured it out and posted it online.

The course staff is of course here to help you. If you get stuck on something spend at least 5 minutes Googling/hacking at it. There is a good chance the first thing one of us will do is turn to google. Don’t spend more than 20 minutes stuck on one thing – at this point you should as for help.

Yak Shaving

Any apparently useless activity which, by allowing you to overcome intermediate difficulties, allows you to solve a larger problem.

I was doing a bit of yak shaving this morning, and it looks like it might have paid off.

Programming/data science requires a lot of yak shaving which can be quite frustrating. You will probably come across the following quote at some point

80 percent of data science is data cleaning

Literate programming and R Markdown

Literate Programming is a concept introduced by Donald Knuth saying you should write code that communicates primarily to hummans, not computers. Here are some examples:

R Markdown allows you easily write documents that contain: R code, text, images, links, etc. It may sounds bland at first R Markdown is pretty amazing. The lecture notes and course webpage were done with R Markdown.

Open a new R Markdown document and play around with it. We will use R Markdown quite a bit in the class. You can read more about R Markdown in r4ds. This document may be helpful to get started with R Markdown: http://stat545.com/block007_first-use-rmarkdown.html

If you are using Python then you’ll find Jupyter notebooks are the best thing since sliced bread (there are now R notebooks) .

Additional references

  • ggplot2 vs. base plotting
  • R vs. Python
    • “better to be good at one then mediocre at first” (I’ve heard this from multiple sources but can’t seem to find a link to one…)
    • a priori doesn’t matter – there are cases when one is better than the other