The reference for these notes is
Read in the UNC depts data
library(tidyverse)
data <- read_csv(url("http://ryanthornburg.com/wp-content/uploads/2015/05/UNC_Salares_NandO_2015-05-06.csv"))
Read the introduction of the EDA chapter (note the definitions of of variable, value, observation, and tabular data.)
How does one variable vary across the observations? Two things to keep an eye out for
Look at summary statistics of a variable. There are two types of summary statistics: location (mean, median, mode) and range/scale (min, max, variance).
median(data$totalsal)
## [1] 59342
max(data$totalsal)
## [1] 819069
A summary statistic compresses all \(N\) observations you have into a single number. These numbers can be informative, but you lose information. Visualizing your data can often provide more insight that the summary statistics alone.
For one variable the most simple visualization is to just plot the raw data.
# plot each data point
ggplot(data=data) +
geom_point(aes(x=totalsal, y=0)) +
ylim(-10, 10)
One problem with the above plot is that data points lie on top of each other. A better version of this plot is a jitter plot
# same plot as above but with random y values
ggplot(data=data) +
geom_jitter(aes(x=totalsal, y=0)) +
ylim(-10, 10)
A jitter plot adds a small amount of random noise to the data so the points no longer overlap.
A plot with all the data points has a lot of information in it. To gain insight it can be useful to compress the information. For example you might make a box plot.
ggplot(data=data) +
geom_boxplot(aes(x=0, y=totalsal))
Or a histogram
ggplot(data=data) +
geom_histogram(aes(x=totalsal), bins = 30)
Histograms have an important parameter: the bin width (or equivalently the number of bins). ggplot2
defaults to 30 bins (divides the range of the data equally). Warning: histograms can look dramatically different depending on the binwidth. There is not necessarily one “correct” bin width. You should use a range of bin-widths and use your judgement.
This is probably too many bins
ggplot(data=data) +
geom_histogram(aes(x=totalsal), bins = 10000)
Probably too few bins
ggplot(data=data) +
geom_histogram(aes(x=totalsal), bins = 2)
A reasonable number of bins
ggplot(data=data) +
geom_histogram(aes(x=totalsal), bins = 100)
Many data data are multimodal. Finding modes can be some of the most important discoveries. Setting the bin-width too wide will miss modes. Setting the bin-width too small will show modes that don’t exist.
This simulated data has two modes. Depending on the bin-width you might see one, two, three (or more?) modes. “Objectively” answering: is that mode really there is non-trivial (e.g. see SiZer)
set.seed(342)
mix <- tibble(val=c(rnorm(n=200, mean=0, sd=1),
rnorm(n=200, mean=.5, sd=1)))
# wide binwidth
ggplot(data=mix) +
geom_histogram(aes(x=val), bins = 10)
# moderate binwidth
ggplot(data=mix) +
geom_histogram(aes(x=val), bins = 30)
# small binwidth
ggplot(data=mix) +
geom_histogram(aes(x=val), bins = 100)
A histogram reduces your N data points into a discrete distribution. A Kernel Density Estimate(KDE) reduces your data to a continuous density.
# geom_density with its default values
ggplot(data=data) +
geom_density(aes(x=totalsal), kernel="gaussian", adjust=1)
As with a histogram, a KDE as parameter(s) that control the level of data compression. You can read more about the details in the geom_density
documentation or on Wikipedia. ggplot
will use “smart defaults” but its worth playing around with this parameter. Warning: always be wary of “smart defaults”. No one default value will work well in every (or even a majority of) situations.
# geom_density with a fat window
ggplot(data=data) +
geom_density(aes(x=totalsal), kernel="gaussian", adjust=10)
# geom_density with a skinny window
ggplot(data=data) +
geom_density(aes(x=totalsal), kernel="gaussian", adjust=.1)
Best practice for exploratory analysis is to include the raw points with a histogram and/or KDE
ggplot(data=data) +
geom_histogram(aes(x=totalsal), bins=100) +
geom_point(aes(x=totalsal, y=0), shape='|', color='red') # use vertical points or jitter
A few things to keep an eye out for
For covariation between two numerical variables correlation is the most simple measure of relationship
cor(data$age, data$totalsal)
## [1] 0.2355144
and a scatter plot is the most simple visualization
ggplot(data=data) +
geom_point(aes(x=age, y=totalsal))
Bar plots allow you to look at the relationship between a categorical variable and a continuous variable
data %>%
filter(dept %in% c("Pediatrics", "Orthodontics" , 'Ophthalmology')) %>%
group_by(dept) %>%
summarise(mean_sal = mean(totalsal)) %>%
ggplot() +
geom_bar(aes(x=dept, y=mean_sal), stat='identity')
## Warning: package 'bindrcpp' was built under R version 3.4.4
You or use a boxplot
data %>%
filter(dept %in% c("Pediatrics", "Orthodontics" , 'Ophthalmology')) %>%
ggplot() +
geom_boxplot(aes(x=dept, y=totalsal)) +
coord_flip() # max the labels horizontal so people can read them!
Clusters!
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
Visualizing the relationships among three more variables becomes challenging. If you have three continuous variables you can make a 3d scatter plot, but these are typically not super useful
You can use additional aesthetic mappings such as color to a 2d scatter plot or shape
ggplot(data=data) +
geom_point(aes(x=age, y=totalsal, color=status))