This course is an application-driven introduction to data science. Statistical and computational tools are valued throughout the modern workplace from Silicon Valley startups, to marine biology labs, to Wall Street firms. These tools require technical skills such as programming and statistics. They also require professional skills such as communication, teamwork, problem solving, and critical thinking.
You will learn these tools and hone these skills through hands-on experience working with datasets such as: Museum of Modern Art records, TCGA Gene Expressions and the text script of Beauty and the Beast. The first half of the semester will cover R programming skills. The second half will cover a number of topics: exploratory data analysis, web scraping, text processing, and effective visualization through a series of modules.
Instructor: Iain Carmichael
Instructional Assistant: Brendan Brown
Graduate Research Consultant: Varun Goel
See the course syllabus for more information.
Most of the course material can be found in the notes linked to below. The notes are suplemented by readings (mostly from R for Data Science) which are listed in the #reading section.
Date | Lecture | Notes | Slides |
---|---|---|---|
January 12 | install R, basic commands | getting started | slides |
January 17 | R Markdown, working directory, R projects | workflow | |
January 19 | select, filter, mutate | dplyr | |
January 24 | pipes, group_by, summarise | dplyr | slides |
January 26 | if/else, loops, functions | intro programming | slides |
January 31 | vectors, lists and eda | more prog and EDA | prog and eda |
February 2 | spread, gather | tidy data | |
February 7 | inner, outer joins (Marshall Markham) | slides | |
February 9 | match, extract, replace with stringr |
regular expressions | slides |
February 14 | look ahead/behinds | regular expressions | |
February 16 | |||
February 21 | least squares, factors, lm() |
linear regression | |
February 23 | test set, nonlinear, interactions | predictive modeling | slides |
February 28 | exploratory data analysis | ||
March 2 | more exploratory analysis | ||
March 7 | nearest centroid, KNN | classification | slides |
March 9 | cross-validation | cross-validation | slides |
March 21 | support vector machine, kernels | more classification | |
March 23 | more classification | slides | |
March 28 | more classification | ||
March 30 | APIs (Marshall Markham) | APIs | slides |
April 4 | web scraping, twitter, ggplot | web scraping, rtweets, custom viz | |
April 6 | Shiny (Frances Tong) | shiny | |
April 11 | communication | effective communication | slides |
April 13 | natural language processing | TidyText | NLP, guest Dan Yang |
April 18 | tf-idf, stemming | text classification | slides |
April 20 | K-means,(Ryan Thornburg) | clustering | slides |
April 25 | in class presentations | in class presentations | |
April 27 | data ethics, Quinn Underriner | data ethics | course summary |
the datasets use in the class are on data.world and github
most of the course material is in the lecture notes (linked to above) and reading – the slides are visual aids for the lectures.
options for extra credit
Readings should be complete by the following class. There are three (free) primary references:
January 12
This tutorial may be helpful for getting started with R Markdown
January 17
January 19
January 24
January 26
January 31
February 2
February 7
February 16
February 21
February 23
February 28
March 2
March 9
March 21
March 28
Apri 4
April 11
April 13
April 18
April 20
April 25
The due date is in the link.
Assigned | Labs | Assignments | In class exercises |
---|---|---|---|
January 12 | data.gov lab 1 | ||
January 17 | reproducible data.gov lab 2 | command line and swirl | |
January 19 | dplyr and unc departments | ||
January 26 | prog lab 3 | ||
February 2 | whales and tidy data lab 4 | ||
February 7 | joins lab 5 | ||
February 9 | strings lab 6 | ||
February 16 | harry potter | ||
February 23 | Ira Glass on overfitting | ||
February 28 | bikes, EDA and predictive modeling | ||
March 21 | classification: does your iPhone know what you’re doing? | ||
March 21 | extra credit: naive bayes | ||
March 23 | What is AI? | ||
March 30 | APIs | ||
April 4 | Web scraping | ||
April 11 | effective viz | ||
April 25 | final project presentation | ||
April 27 | data ethics |
See this document for the final project description.
Deliverable | Due Date |
---|---|
proposal | 4/6 |
exploratory analysis | 4/25 |
final analysis | 5/7 |
blog post | 5/9 |
I’ve posted some ideas for project ideas in this document.
This course was made possible by a grant from the Data@Carolina initiative and a ton of input from lots of very smart people.
This page was last updated on 2017-09-20 23:41:48 Eastern Time.