Due Tuesday 1/31/17 before the start of class.

Universities are typically opaque, bureaucratic institutions. Many public schools such as the University of North Carolina system make data about university employees publically available. Using this dataset what can we learn about the dynamics of the university system?

dplyr

To download the .Rmd file for this assignment go to the github repo, right click und_depts.Rmd and select save link as.

The best way to to learn the dplyr package is to just use it a lot. Read through the following dplyr tutorial the complete questions in this .Rmd document. I suggest doing the coding in an R script then putting the answers into this document.

Instructions

Q: Return the sum of the first 10 positive integers?

Q: What is the sum of the first 10 positive integers?

sum(1:10)
## [1] 55

Create a variable x with the sum of the first 10 positive integers.

x <- sum(1:10)

The data

First download the data set from Ryan Thornburg’s website (it is also on stor390’s github repo). You may want to set cache=TRUE for this chunk so you don’t have to download the data set every time you run the script.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
# Load the UNC departments data
data <- read_csv(url("http://ryanthornburg.com/wp-content/uploads/2015/05/UNC_Salares_NandO_2015-05-06.csv"))

Take a first look at the data

colnames(data)
##  [1] "name"     "campus"   "dept"     "position" "exempt2"  "employed"
##  [7] "hiredate" "fte"      "status"   "stservyr" "statesal" "nonstsal"
## [13] "totalsal" "age"
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    12287 obs. of  14 variables:
##  $ name    : chr  "AARON, NANCY G" "ABARBANELL, JEFFERY S" "ABARE, BETSY" "ABATE, AARON B" ...
##  $ campus  : chr  "UNC-CH" "UNC-CH" "UNC-CH" "UNC-CH" ...
##  $ dept    : chr  "Romance Languages" "Kenan-Flagler Business School" "Institute of Marine Sciences" "Medicine Administration" ...
##  $ position: chr  "Senior Lecturer" "Associate Professor" "Research Technician" "Accounting Technician" ...
##  $ exempt2 : chr  "Exempt" "Exempt" "Subject to State Personnel Act" "Subject to State Personnel Act" ...
##  $ employed: int  9 9 12 12 12 12 12 12 12 12 ...
##  $ hiredate: int  20030701 19990101 20110912 20090420 20120103 20051003 19960923 20130401 19870101 20120702 ...
##  $ fte     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ status  : chr  "Fixed-Term" "Continuing" "Permanent" "Permanent" ...
##  $ stservyr: int  11 17 3 5 2 15 34 11 27 2 ...
##  $ statesal: int  46350 173000 0 0 41696 56588 41707 0 0 0 ...
##  $ nonstsal: int  0 0 38170 50070 0 4412 0 80227 55803 32889 ...
##  $ totalsal: int  46350 173000 38170 50070 41696 61000 41707 80227 55803 32889 ...
##  $ age     : int  55 57 54 29 35 41 62 36 64 26 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 14
##   .. ..$ name    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ campus  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ dept    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ position: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ exempt2 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ employed: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ hiredate: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ fte     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ status  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ stservyr: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ statesal: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ nonstsal: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ totalsal: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ age     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"
head(data)
## # A tibble: 6 x 14
##   name  campus dept  position exempt2 employed hiredate   fte status
##   <chr> <chr>  <chr> <chr>    <chr>      <int>    <int> <dbl> <chr> 
## 1 AARO… UNC-CH Roma… Senior … Exempt         9 20030701     1 Fixed…
## 2 ABAR… UNC-CH Kena… Associa… Exempt         9 19990101     1 Conti…
## 3 ABAR… UNC-CH Inst… Researc… Subjec…       12 20110912     1 Perma…
## 4 ABAT… UNC-CH Medi… Account… Subjec…       12 20090420     1 Perma…
## 5 ABAT… UNC-CH Scho… Student… Subjec…       12 20120103     1 Perma…
## 6 ABBO… UNC-CH Medi… HR Cons… Subjec…       12 20051003     1 Perma…
## # ... with 5 more variables: stservyr <int>, statesal <int>,
## #   nonstsal <int>, totalsal <int>, age <int>

hints

Assignment

Question 1

Return a data frame with columns: name, dept, age,totalsal

#

Return a data frame with columns: statesal, nonstsal,totalsal using one of the select_helpers functions

#

Question 2

Rename the fte column to fulltime. Make sure this change is saved (i.e. data <- ...).

#

Question 3

What is the mean salary in the Neurosurgery department?

#

Return a data frame with employee’s in the Neurosurgery department making more than $500,000. Why might these professors be so well paid?

#

Question 4

What is the total amount that full time Dermatology employees get paid?

#

Question 5

Create a data frame called radio_dept whose rows are the employees from the Radiology department.

  • include only the following columns: name, position, age, nonstsal, totalsal.
  • order the employees by salary

First without chaining

#

Next with chaining

#

Make a histogram of Radiology salaries

#

Question 6

Use faceting to make totalsal histograms for the Radiation Oncology, Radiology, Pediatrics, and Orthopaedics departments.

#

Use faceting to make plot age vs. totalsal for the same four departments and color points by status.

#

Question 7

Create a data frame called dept_summary whose rows are the departments and whose columns are: department size, mean department salary, median department salary, and maximum salary (using totalsal for salary).

#

Order the departments by highest mean salary and print the 10 highest paid departments.

#

Order the departments by highest median salary and print the 10 highest paid departments.

#

Why do these lists differ? If you were asked for the top 10 best paid departments at UNC which summary would you choose and why?

Question 8

How many departments have at least 10 employees?

#

Question 9

Which department hired the most people in 2010? Hint: you probably have to modify hiredate.

#

Question 10

Make a list of all the department names and sort this list alphabetically. What is the 42nd department in this list?

#

Question 11

Plot number of people hired by the CS dept per year vs. year

#

Now add STOR, Math, Biostatistics, SILS and Physics to the above plot

#

Open question

Examine the relationship between age and salary at UNC. You can interpret this however you like e.g.

Requirements