Download the latest version of R from the Comprehensive R Archive Network (CRAN). R is a programming language built for statistical analysis.
Download R Studio which is an IDE built for R. While you can use R without R Studio, R Studio makes life much better.
Watch the video on this page for the basics parts of R Studio
If you already have R and R Studio please update both of them. For instructions see this page.
You downloaded base R from CRAN.
Run the following in your R console
1 + 1
## [1] 2
a <- 1
b <- 2
a + b
## [1] 3
If you are new to R I suggest reading through before we start and intro to R.
The power of R comes from the many wonderful R packages people develop. R is an open source language meaning anyone can develop a new R package.
You can install a package from CRAN like this
install.packages("tidyverse")
There are other sources of R packages such as Bioconductor and Github. Packages from CRAN and Bioconductor are vetted (though not perfectly). Packages on Github are not.
To use code from an R package you need to load it
# ignore the warnings for now
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
You need to load an R package every time you want to use it. You only need to install it once.
Load the movies data set generously curated by Mine Cetinkaya-Rundel
# downloads data set and loads it into R
load(url('https://stat.duke.edu/~mc301/data/movies.Rdata'))
The first thing you should do when you get a data set is look at it!
str()
tells you about the data frame. First thing to note is the dimension of the data frame (651 rows by 32 columns)and the column types
str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
head()
prints the first six rows of a data set (and as many columns that will fit on the screen)
head(movies)
## # A tibble: 6 × 32
## title title_type genre runtime mpaa_rating
## <chr> <fctr> <fctr> <dbl> <fctr>
## 1 Filly Brown Feature Film Drama 80 R
## 2 The Dish Feature Film Drama 101 PG-13
## 3 Waiting for Guffman Feature Film Comedy 84 R
## 4 The Age of Innocence Feature Film Drama 139 PG
## 5 Malevolence Feature Film Horror 90 R
## 6 Old Partner Documentary Documentary 78 Unrated
## # ... with 27 more variables: studio <fctr>, thtr_rel_year <dbl>,
## # thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## # dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## # imdb_num_votes <int>, critics_rating <fctr>, critics_score <dbl>,
## # audience_rating <fctr>, audience_score <dbl>, best_pic_nom <fctr>,
## # best_pic_win <fctr>, best_actor_win <fctr>, best_actress_win <fctr>,
## # best_dir_win <fctr>, top200_box <fctr>, director <chr>, actor1 <chr>,
## # actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>,
## # imdb_url <chr>, rt_url <chr>
If you double click a data frame it will pull up R’s built in spreadsheet
summary()
prints out some descriptive statistics of each column
summary(movies)
## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## runtime mpaa_rating studio
## Min. : 39.0 G : 19 Paramount Pictures : 37
## 1st Qu.: 92.0 NC-17 : 2 Warner Bros. Pictures : 30
## Median :103.0 PG :118 Sony Pictures Home Entertainment: 27
## Mean :105.8 PG-13 :133 Universal Pictures : 23
## 3rd Qu.:115.8 R :329 Warner Home Video : 19
## Max. :267.0 Unrated: 50 (Other) :507
## NA's :1 NA's : 8
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1970 Min. : 1.00 Min. : 1.00 Min. :1991
## 1st Qu.:1990 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.00 Median :15.00 Median :2004
## Mean :1998 Mean : 6.74 Mean :14.42 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008
## Max. :2014 Max. :12.00 Max. :31.00 Max. :2015
## NA's :8
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 180
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546
## Median : 6.000 Median :15.00 Median :6.600 Median : 15116
## Mean : 6.333 Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58300
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
## NA's :8 NA's :8
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:135 Min. : 1.00 Spilled:275 Min. :11.00
## Fresh :209 1st Qu.: 33.00 Upright:376 1st Qu.:46.00
## Rotten :307 Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :629 no :644 no :558 no :579 no :608
## yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :636 Length:651 Length:651 Length:651
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url
## Length:651 Length:651
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
The $
sign after a data frame will return a column
movies$imdb_num_votes
## [1] 899 12285 22381 35096 2386 333 5016 2272 880 12496
## [11] 71979 9669 201779 25808 5544 240033 66489 6336 37769 21268
## [21] 56201 3459 16717 9357 4541 1816 163490 19285 3688 3488
## [31] 82851 56128 33101 259822 4908 290356 47297 2145 466400 69338
## [41] 6228 75468 48519 13523 58668 9654 22079 5258 761 33839
## [51] 183 225130 62241 19714 137126 10468 37770 1628 5587 6247
## [61] 8646 8320 2362 1942 110238 21501 14359 12450 21704 4375
## [71] 48324 4768 2944 6788 24783 3745 54726 1493 5616 10492
## [81] 4451 12269 60335 5035 315051 82737 4944 8229 4516 7858
## [91] 10938 279704 11192 3851 1838 6552 390 68429 182983 1147
## [101] 35635 2569 13980 11477 9853 14970 9001 35868 41767 6954
## [111] 893008 3467 117688 2817 32751 5704 318 123989 70209 541
## [121] 132215 127458 27769 52635 12819 6054 9367 17798 26010 4515
## [131] 535 71572 57933 1784 3883 2282 105745 104457 205065 562136
## [141] 1778 7628 17934 2502 13092 1010 4687 48137 164112 30921
## [151] 37640 95327 289825 15525 43268 3153 1887 3096 112216 154674
## [161] 12535 3363 40659 82378 3135 1489 41385 88523 12221 6304
## [171] 9565 15913 414687 33720 325 22245 17133 23821 13285 192052
## [181] 211129 146518 2239 486 7658 1268 64119 24678 6114 9399
## [191] 414650 15291 725 93331 9876 9525 44741 285 3673 79970
## [201] 1058 60220 5591 340 50340 2934 5564 753592 1141 3967
## [211] 287476 94983 56185 3138 25264 9424 16681 1308 87215 121245
## [221] 71112 72295 2289 84191 235529 168032 56329 2849 26360 5762
## [231] 8521 191935 872 8999 184656 2098 375820 19187 1406 16755
## [241] 9003 5014 2818 38076 137405 56361 7244 2701 18712 25054
## [251] 1510 8544 8561 151934 52449 11001 42613 3505 74294 1480
## [261] 703 103789 42842 26731 26628 149437 9291 157701 14901 1361
## [271] 47065 12606 246587 42208 9216 201787 19000 1799 12877 9025
## [281] 5985 30886 4970 2732 3336 252661 30641 22601 7076 161601
## [291] 285328 2598 108598 20655 12322 99192 44257 77762 368799 2830
## [301] 2960 46233 9980 8016 37938 16480 63219 4251 23697 53675
## [311] 9787 1978 9370 10380 73280 7656 11855 172765 23201 8604
## [321] 14949 1995 15806 30694 56888 4821 3145 19115 70994 66233
## [331] 2295 2698 34652 739 64489 49985 3649 3359 10020 78726
## [341] 246907 49374 40001 18670 5149 35577 38076 87652 17329 16511
## [351] 5002 749783 10599 3101 4180 9904 3428 2056 100447 37506
## [361] 109633 21443 4031 47692 47343 24084 13215 105982 34461 15714
## [371] 3358 275125 13280 2551 5863 73219 8319 265725 3859 53535
## [381] 4907 318019 679 806911 3342 3649 27417 1815 9939 4857
## [391] 1428 44248 18141 303529 86953 128361 297034 4143 3128 36909
## [401] 1816 1043 54829 9725 490295 7881 5136 10055 2408 24472
## [411] 115026 651 7710 5425 4550 16262 1571 10271 204042 16824
## [421] 6061 10651 1346 1886 4874 4121 40133 3416 73617 183747
## [431] 123588 124250 11103 33040 11236 9946 1680 122980 19603 1663
## [441] 71141 13790 5374 14589 11259 6472 100416 3866 872 2928
## [451] 3887 1607 4904 19937 17384 25683 3883 99582 2959 134031
## [461] 17190 135840 2897 10126 19383 8030 3461 3970 572236 4072
## [471] 126257 15491 51070 2530 30495 16955 797101 8059 60483 3602
## [481] 34802 3730 30085 32737 34307 66171 8685 54597 7862 68871
## [491] 582091 11156 6345 9990 3473 42295 329613 42408 137222 51366
## [501] 21623 39320 1915 1674 2096 1935 10522 2380 78862 83724
## [511] 34298 830 2869 134510 152216 54771 11838 110540 6343 309494
## [521] 6909 1890 72176 2271 6804 161101 10535 448434 2931 21009
## [531] 6811 1943 32338 19161 54871 2433 21924 128298 14559 34926
## [541] 27097 4021 764 6418 70737 9656 193702 59076 154148 11377
## [551] 3302 180 34253 17960 281 12498 86831 83424 1803 56919
## [561] 10250 18005 62773 15444 48756 14986 13525 246343 290958 3146
## [571] 10886 96471 17101 723 106171 88777 294683 51534 3487 5115
## [581] 15449 2181 9832 247105 13614 78297 4369 6765 16137 101850
## [591] 504 1935 15116 183717 64873 20738 123769 79866 160237 24595
## [601] 16883 390 19539 48718 26301 26943 4077 63511 27601 756602
## [611] 3998 19898 10786 2857 9675 7545 2113 3448 2441 12402
## [621] 3373 6322 9906 15025 30826 309896 7284 58907 57251 3790
## [631] 8818 11125 675907 2120 111132 103378 13682 63672 6946 3584
## [641] 54363 11197 96787 16366 134270 11657 8345 46794 10087 66054
## [651] 43574
The mean()
function computes the mean of a vector. There is also a median
, var
, min
, max
mean(movies$imdb_num_votes)
## [1] 57532.98
You can only learn so much by looking lists of numbers. Let’s make some plots.
There are two popular plotting systems in R. There is the base R system
plot(movies$imdb_rating, movies$critics_score)
and ggplot2.
# ggplot was loaded with tidyverse
ggplot(data = movies) + geom_point(mapping = aes(x = imdb_rating, y = critics_score))
We will use ggplot2 in this course (see readings below about ggplot2 vs base). ggplot2 can be a bit intimidating at first – especially if you are used to base plotting.
ggplot(data=movies, aes(x=audience_score)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot2 has a ton of functionality built it and you will learn to love it when you get used to it.
ggplot(data = movies) + geom_point(mapping = aes(x = imdb_rating, y = critics_score, color=mpaa_rating))
The main textbook for this class is R for Data Science written by Hadley Wickham (it’s free online). I have put up a long list of alternative resources (textbooks, coursera courses, etc).
Google and StackOverflow will become your best friends. If you have a question, chances are someone has already asked and answered it. If R gives you an error message you don’t understand google it – someone else has probably figured it out and posted it online.
The course staff is of course here to help you. If you get stuck on something spend at least 5 minutes Googling/hacking at it. There is a good chance the first thing one of us will do is turn to google. Don’t spend more than 20 minutes stuck on one thing – at this point you should as for help.
Any apparently useless activity which, by allowing you to overcome intermediate difficulties, allows you to solve a larger problem.
I was doing a bit of yak shaving this morning, and it looks like it might have paid off.
Programming/data science requires a lot of yak shaving which can be quite frustrating. You will probably come across the following quote at some point
80 percent of data science is data cleaning
Literate Programming is a concept introduced by Donald Knuth saying you should write code that communicates primarily to hummans, not computers. Here are some examples:
R Markdown allows you easily write documents that contain: R code, text, images, links, etc. It may sounds bland at first R Markdown is pretty amazing. The lecture notes and course webpage were done with R Markdown.
Open a new R Markdown document and play around with it. We will use R Markdown quite a bit in the class. You can read more about R Markdown in r4ds. This document may be helpful to get started with R Markdown: http://stat545.com/block007_first-use-rmarkdown.html
If you are using Python then you’ll find Jupyter notebooks are the best thing since sliced bread (there are now R notebooks) .