Due Tuesday 1/31/17 before the start of class.
Universities are typically opaque, bureaucratic institutions. Many public schools such as the University of North Carolina system make data about university employees publically available. Using this dataset what can we learn about the dynamics of the university system?
To download the .Rmd file for this assignment go to the github repo, right click und_depts.Rmd and select save link as.
The best way to to learn the dplyr package is to just use it a lot. Read through the following dplyr tutorial the complete questions in this .Rmd document. I suggest doing the coding in an R script then putting the answers into this document.
chapter 5 (data transformation) from r4ds
dplyr
functions to answer the following questions (base R will not be given credit)pay careful attention to the questions
return something means just print it out. Similarly for questions What is … just print out the answer
Q: Return the sum of the first 10 positive integers?
Q: What is the sum of the first 10 positive integers?
sum(1:10)
## [1] 55
Create a variable x
with the sum of the first 10 positive integers.
x <- sum(1:10)
First download the data set from Ryan Thornburg’s website (it is also on stor390’s github repo). You may want to set cache=TRUE
for this chunk so you don’t have to download the data set every time you run the script.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
# Load the UNC departments data
data <- read_csv(url("http://ryanthornburg.com/wp-content/uploads/2015/05/UNC_Salares_NandO_2015-05-06.csv"))
Take a first look at the data
colnames(data)
## [1] "name" "campus" "dept" "position" "exempt2" "employed"
## [7] "hiredate" "fte" "status" "stservyr" "statesal" "nonstsal"
## [13] "totalsal" "age"
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 12287 obs. of 14 variables:
## $ name : chr "AARON, NANCY G" "ABARBANELL, JEFFERY S" "ABARE, BETSY" "ABATE, AARON B" ...
## $ campus : chr "UNC-CH" "UNC-CH" "UNC-CH" "UNC-CH" ...
## $ dept : chr "Romance Languages" "Kenan-Flagler Business School" "Institute of Marine Sciences" "Medicine Administration" ...
## $ position: chr "Senior Lecturer" "Associate Professor" "Research Technician" "Accounting Technician" ...
## $ exempt2 : chr "Exempt" "Exempt" "Subject to State Personnel Act" "Subject to State Personnel Act" ...
## $ employed: int 9 9 12 12 12 12 12 12 12 12 ...
## $ hiredate: int 20030701 19990101 20110912 20090420 20120103 20051003 19960923 20130401 19870101 20120702 ...
## $ fte : num 1 1 1 1 1 1 1 1 1 1 ...
## $ status : chr "Fixed-Term" "Continuing" "Permanent" "Permanent" ...
## $ stservyr: int 11 17 3 5 2 15 34 11 27 2 ...
## $ statesal: int 46350 173000 0 0 41696 56588 41707 0 0 0 ...
## $ nonstsal: int 0 0 38170 50070 0 4412 0 80227 55803 32889 ...
## $ totalsal: int 46350 173000 38170 50070 41696 61000 41707 80227 55803 32889 ...
## $ age : int 55 57 54 29 35 41 62 36 64 26 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 14
## .. ..$ name : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ campus : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ dept : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ position: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ exempt2 : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ employed: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ hiredate: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ fte : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ status : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ stservyr: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ statesal: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ nonstsal: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ totalsal: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ age : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
head(data)
## # A tibble: 6 x 14
## name campus dept position exempt2 employed hiredate fte status
## <chr> <chr> <chr> <chr> <chr> <int> <int> <dbl> <chr>
## 1 AARO… UNC-CH Roma… Senior … Exempt 9 20030701 1 Fixed…
## 2 ABAR… UNC-CH Kena… Associa… Exempt 9 19990101 1 Conti…
## 3 ABAR… UNC-CH Inst… Researc… Subjec… 12 20110912 1 Perma…
## 4 ABAT… UNC-CH Medi… Account… Subjec… 12 20090420 1 Perma…
## 5 ABAT… UNC-CH Scho… Student… Subjec… 12 20120103 1 Perma…
## 6 ABBO… UNC-CH Medi… HR Cons… Subjec… 12 20051003 1 Perma…
## # ... with 5 more variables: stservyr <int>, statesal <int>,
## # nonstsal <int>, totalsal <int>, age <int>
Read through the first dplyr tutorial. Every question can be answered using: filter, arrange, select, mutate, summarize and group_by functions.
here are two more dplyr tutorials: https://rpubs.com/justmarkham/dplyr-tutorial and https://www.coursera.org/learn/data-cleaning/home/week/3
fte means full time employee
Return a data frame with columns: name, dept, age,totalsal
#
Return a data frame with columns: statesal, nonstsal,totalsal using one of the select_helpers
functions
#
Rename the fte
column to fulltime
. Make sure this change is saved (i.e. data <- ...
).
#
What is the mean salary in the Neurosurgery department?
#
Return a data frame with employee’s in the Neurosurgery department making more than $500,000. Why might these professors be so well paid?
#
What is the total amount that full time Dermatology employees get paid?
#
Create a data frame called radio_dept whose rows are the employees from the Radiology department.
First without chaining
#
Next with chaining
#
Make a histogram of Radiology salaries
#
Use faceting to make totalsal histograms for the Radiation Oncology, Radiology, Pediatrics, and Orthopaedics departments.
#
Use faceting to make plot age vs. totalsal for the same four departments and color points by status.
#
Create a data frame called dept_summary
whose rows are the departments and whose columns are: department size, mean department salary, median department salary, and maximum salary (using totalsal for salary).
#
Order the departments by highest mean salary and print the 10 highest paid departments.
#
Order the departments by highest median salary and print the 10 highest paid departments.
#
Why do these lists differ? If you were asked for the top 10 best paid departments at UNC which summary would you choose and why?
How many departments have at least 10 employees?
#
Which department hired the most people in 2010? Hint: you probably have to modify hiredate
.
#
Make a list of all the department names and sort this list alphabetically. What is the 42nd department in this list?
#
Plot number of people hired by the CS dept per year vs. year
#
Now add STOR, Math, Biostatistics, SILS and Physics to the above plot
#
Examine the relationship between age and salary at UNC. You can interpret this however you like e.g.
Requirements