Captial Bikeshare releases a large amount of their data to the public about bike rentals in Washington, DC. For example they released a data set with the hourly count of bike rentals in 2011 and 2012 with corresponding weather/seasonal covariates on the famous UCI Machine Learning Repository. You can find information on what the variables mean here.
In this assignment you will first do an exploratory analysis of the 2011 data. You will then build a model to predict rental counts for the 2012 data.
Assignment 3 is due Friday March 10th. There are three deliverables: 2 .Rmd files and one .csv file. Some examples are given below.
Make 10 exploratory figures and include a brief discussion of each figure.
A figure can be a plot, the output of a linear regression, or a non-trivial piece of code you write (e.g. a t-test).
Most of the figures should be plots.
The figures should be fairly different. For example, a faceted plot only counts as one figure.
The axes should be labels and there should be a legend (if applicable).
The figure discussion should include:
Short description of the figure and the takeaway(s).
Some details supporting/qualifying the takeaway(s).
The deliverable for this part is a .Rmd file with the plots and a discussion.
Based on the bike sharing data from 2011 write a model that predicts the ride counts for each hour in 2012. Try at least 2 “different” models and compare them using a test set. You have the full data from 2011 and all the x data from 2012 (see below).
There are two deliverable for this part
a .Rmd file with the code and a brief discussion
a .csv with your predictions named bike_predictions_2012_ONYEN.csv
For the .Rmd file you should include the code for training each model and comparing each model. You should also include a brief discussion describing each model.
The predictions .csv should be the original 2012 file with a new column added called cnt_pred
. Use the write_csv
function to write the .csv.
For this task you should first split the 2011 data into a train
and test
set. First train each of your models on the train
set then compare them using the test
set to help decide which model is better.
For our purpose “different” means different features. You know how to create new features with variable transformations and interactions. One way to do build a model is to hand engineer some features you think make sense. Another way is to build a ton of features then apply a variable selection procedure to narrow this large set down. Include a brief discussion of how you built each model and why you made the choices that you made.
For grading purposes we are mainly looking for a reasonable, honest effort. Some don’ts
your final model should probably have a little more to it than one variable linear regression (unless you include some evidence that this model is in fact the right way to go)
for the second model don’t take the first model, make one minor modification and call it a day
#winning Whomever builds the best predictive model, as measured by mean squared error on the 2012 data, will receive one extra credit point. Additionally, another extra credit point will be awarded to the best insight from the EDA. More extra credit points may be awarded for going above and beyond the assignment requirements.
Read in the data with
library(tidyverse)
# data from 2011
hour11 <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_2011.csv')
# x data from 2012
hour12_x <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/bikes_12_x.csv')
hour11 %>%
ggplot() +
geom_jitter(aes(x=hr, y=cnt), width=.2, height=0, size=.3) +
ggtitle('Figure 1') +
labs(x = 'hour', y = 'rental count')
Figure 1 shows bike rental counts per hour for each hour in 2011. As might be expected the rental traffic is higher during the day than at night. There appear to be two traffic spikes: one in the morning (around 8am) and one in the evening (around 5pm). These spikes might suggest that people use Capital Bike rentals to get to and from work. Looking carefully at the morning spike there appear to be two clusters: one with higher traffic, one with lower traffic. It could be worthwhile to determine what is driving this possible difference in bike rental behavior in the morning.
# fit a linear regression of count vs. hour in 2011
lin_reg <- lm(cnt ~ hr, data=hour11)
# get predictions for 2012 data
pred_12 <- predict(lin_reg, newdata = hour12_x)
# add predictions to 2012 data frame
hour12_x <- hour12_x %>%
mutate(cnt_pred=pred_12)
# save the predictions
write_csv(hour12_x, 'bike_predictions_2012_idcarm.csv')