This course is an application-driven introduction to data science. Statistical and computational tools are valued throughout the modern workplace from Silicon Valley startups, to marine biology labs, to Wall Street firms. These tools require technical skills such as programming and statistics. They also require professional skills such as communication, teamwork, problem solving, and critical thinking.

You will learn these tools and hone these skills through hands-on experience working with datasets such as: Museum of Modern Art records, TCGA Gene Expressions and the text script of Beauty and the Beast. The first half of the semester will cover R programming skills. The second half will cover a number of topics: exploratory data analysis, web scraping, text processing, and effective visualization through a series of modules.

Instructor: Iain Carmichael
Instructional Assistant: Brendan Brown
Graduate Research Consultant: Varun Goel

See the course syllabus for more information.

Course Material

Most of the course material can be found in the notes linked to below. The notes are suplemented by readings (mostly from R for Data Science) which are listed in the #reading section.

Date	Lecture	Notes	Slides
January 12	install R, basic commands	getting started	slides
January 17	R Markdown, working directory, R projects	workflow
January 19	select, filter, mutate	dplyr
January 24	pipes, group_by, summarise	dplyr	slides
January 26	if/else, loops, functions	intro programming	slides
January 31	vectors, lists and eda	more prog and EDA	prog and eda
February 2	spread, gather	tidy data
February 7	inner, outer joins (Marshall Markham)	slides
February 9	match, extract, replace with `stringr`	regular expressions	slides
February 14	look ahead/behinds	regular expressions
February 16
February 21	least squares, factors, `lm()`	linear regression
February 23	test set, nonlinear, interactions	predictive modeling	slides
February 28	exploratory data analysis
March 2	more exploratory analysis
March 7	nearest centroid, KNN	classification	slides
March 9	cross-validation	cross-validation	slides
March 21	support vector machine, kernels	more classification
March 23		more classification	slides
March 28	more classification
March 30	APIs (Marshall Markham)	APIs	slides
April 4	web scraping, twitter, ggplot	web scraping, rtweets, custom viz
April 6	Shiny (Frances Tong)	shiny
April 11	communication	effective communication	slides
April 13	natural language processing	TidyText	NLP, guest Dan Yang
April 18	tf-idf, stemming	text classification	slides
April 20	K-means,(Ryan Thornburg)	clustering	slides
April 25	in class presentations	in class presentations
April 27	data ethics, Quinn Underriner	data ethics	course summary

the datasets use in the class are on data.world and github
all of the course material is on the github repo including
- .Rmd files for the notes
- example code are also on github
most of the course material is in the lecture notes (linked to above) and reading – the slides are visual aids for the lectures.
options for extra credit

Reading

Readings should be complete by the following class. There are three (free) primary references:

R for Data Science (r4ds)
Introduction to Statistical Learning with Applications in R (ISLR)
Text Mining with R (TMR)

January 12

read through the getting started notes
read before we start from data carpentry
sections 1, 2, 3.1-3.5 of r4ds
- I suggest copying/pasting and running some of the example code
Data science done well looks easy - and that is a big problem for data scientists
This tutorial may be helpful for getting started with R Markdown
(Optional) For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights and David Mimno’s response
big data is like teenage sex

January 17

sections 3.5 - 3.10 (data viz) and section 8 (workflow) from r4ds
Reproducibility is not just for researchers
- (optionally) The real reason reproducible research is important from Simply Statistics

January 19

section 5 (data transformation) from r4ds
(optionally) the dplyr flights vignettes

January 24

section 18 (pipes) and sections 19.1-19.4 (functions) from r4ds

January 26

section 19.5-7 and sections 21.1-21.3 (loops) from r4ds

January 31

r4ds section 12 (tidy data)

February 2

r4ds section 13.1-13.4 (relational data)
read about non-tidy data
(optional) how to share data with a statistician

February 7

r4ds chapter 14 (strings)
(optional) the rest of the relational data chapter (13.5-13.7)

February 16

r4ds section 22, 23.1-23.3 (models)

February 21

r4ds sections 23.4-23.6 (models)
alternate sources for regression (optional)
- ISLR sections 3.1-3.2
- Machine Learning for Hackers chapter 5
- Machine Learning a Probabilistic Perspective chapter 7

February 23

ISLR sections 2.1-2.2 for an overview of modeling
r4ds chapter 24

February 28

ISLR sections 6.1-6.1 on variable selection

March 2

make sure you have read and understand linear regression and model selection from ISLR i.e. section 2.1, 2.2, 3.1, 3.2, 6.1, 6.2
ISLR sections 4.1-4.3 about classification

March 9

What is artificial intelligence? by Jeff Leek

March 21

An example that isn’t that artificial or intelligent
- be prepared to discuss this article and the other AI article from Jeff Leek in class on Thursday
ISLR sections 9.1-9.2 on support vector machines

March 28

ISLR section 9.3, 9.4 on SVMs.
ISLR section 5.1 on cross validation.
read through the custom viz notes

Apri 4

read the markup section of Wikipedia about HTML

April 11

read the following three short articles about communication
- so what? – convey message
- how to asking questions
- reproducible examples

April 13

chapter 1 of Text Mining with R

April 18

chapter 3 from TMR
(optional) chapter 4 from TMR
stemming and lemmatization

April 20

ISLR sections 10.1, 10.3.1, and 10.3.3 about K-means and clustering

April 25

interview with Cathy O’Neil on weapons of math destruction
Unroll.me Service Faces Backlash Over a Widespread Practice: Selling User Data
Data, privacy, and the greater good (optional, but super interesting)

Homework

The due date is in the link.

Assigned	Labs	Assignments	In class exercises
January 12	data.gov lab 1
January 17	reproducible data.gov lab 2		command line and swirl
January 19		dplyr and unc departments
January 26	prog lab 3
February 2	whales and tidy data lab 4
February 7	joins lab 5
February 9	strings lab 6
February 16		harry potter
February 23	Ira Glass on overfitting
February 28		bikes, EDA and predictive modeling
March 21		classification: does your iPhone know what you’re doing?
March 21		extra credit: naive bayes
March 23			What is AI?
March 30	APIs
April 4	Web scraping
April 11	effective viz
April 25			final project presentation
April 27			data ethics

Final project

See this document for the final project description.

Deliverable	Due Date
proposal	4/6
exploratory analysis	4/25
final analysis	5/7
blog post	5/9

I’ve posted some ideas for project ideas in this document.

Additional resources

the extra resources page conatins a number of books, blogs, MOOCS and other courses about data science
a collection of lot’s of datasets
finding a job/internship

Miscellaneous

This course was made possible by a grant from the Data@Carolina initiative and a ton of input from lots of very smart people.

This page was last updated on 2017-09-20 23:41:48 Eastern Time.

STOR 390: Introduction to Data Science