This course is an application-driven introduction to data science. Statistical and computational tools are valued throughout the modern workplace from Silicon Valley startups, to marine biology labs, to Wall Street firms. These tools require technical skills such as programming and statistics. They also require professional skills such as communication, teamwork, problem solving, and critical thinking.

You will learn these tools and hone these skills through hands-on experience working with datasets such as: Museum of Modern Art records, TCGA Gene Expressions and the text script of Beauty and the Beast. The first half of the semester will cover R programming skills. The second half will cover a number of topics: exploratory data analysis, web scraping, text processing, and effective visualization through a series of modules.

See the course syllabus for more information.

Course Material

Most of the course material can be found in the notes linked to below. The notes are suplemented by readings (mostly from R for Data Science) which are listed in the #reading section.

Date Lecture Notes Slides
January 12 install R, basic commands getting started slides
January 17 R Markdown, working directory, R projects workflow
January 19 select, filter, mutate dplyr
January 24 pipes, group_by, summarise dplyr slides
January 26 if/else, loops, functions intro programming slides
January 31 vectors, lists and eda more prog and EDA prog and eda
February 2 spread, gather tidy data
February 7 inner, outer joins (Marshall Markham) slides
February 9 match, extract, replace with stringr regular expressions slides
February 14 look ahead/behinds regular expressions
February 16
February 21 least squares, factors, lm() linear regression
February 23 test set, nonlinear, interactions predictive modeling slides
February 28 exploratory data analysis
March 2 more exploratory analysis
March 7 nearest centroid, KNN classification slides
March 9 cross-validation cross-validation slides
March 21 support vector machine, kernels more classification
March 23 more classification slides
March 28 more classification
March 30 APIs (Marshall Markham) APIs slides
April 4 web scraping, twitter, ggplot web scraping, rtweets, custom viz
April 6 Shiny (Frances Tong) shiny
April 11 communication effective communication slides
April 13 natural language processing TidyText NLP, guest Dan Yang
April 18 tf-idf, stemming text classification slides
April 20 K-means,(Ryan Thornburg) clustering slides
April 25 in class presentations in class presentations
April 27 data ethics, Quinn Underriner data ethics course summary
  • the datasets use in the class are on data.world and github

  • all of the course material is on the github repo including
  • most of the course material is in the lecture notes (linked to above) and reading – the slides are visual aids for the lectures.

  • options for extra credit

Reading

Readings should be complete by the following class. There are three (free) primary references:

January 12

January 17

January 19

January 24

January 26

  • section 19.5-7 and sections 21.1-21.3 (loops) from r4ds

January 31

February 2

February 7

  • r4ds chapter 14 (strings)
  • (optional) the rest of the relational data chapter (13.5-13.7)

February 16

  • r4ds section 22, 23.1-23.3 (models)

February 21

February 23

February 28

  • ISLR sections 6.1-6.1 on variable selection

March 2

  • make sure you have read and understand linear regression and model selection from ISLR i.e. section 2.1, 2.2, 3.1, 3.2, 6.1, 6.2
  • ISLR sections 4.1-4.3 about classification

March 9

March 21

March 28

Apri 4

April 11

April 13

April 18

April 20

  • ISLR sections 10.1, 10.3.1, and 10.3.3 about K-means and clustering

April 25

Homework

The due date is in the link.

Assigned Labs Assignments In class exercises
January 12 data.gov lab 1
January 17 reproducible data.gov lab 2 command line and swirl
January 19 dplyr and unc departments
January 26 prog lab 3
February 2 whales and tidy data lab 4
February 7 joins lab 5
February 9 strings lab 6
February 16 harry potter
February 23 Ira Glass on overfitting
February 28 bikes, EDA and predictive modeling
March 21 classification: does your iPhone know what you’re doing?
March 21 extra credit: naive bayes
March 23 What is AI?
March 30 APIs
April 4 Web scraping
April 11 effective viz
April 25 final project presentation
April 27 data ethics

Final project

See this document for the final project description.

Deliverable Due Date
proposal 4/6
exploratory analysis 4/25
final analysis 5/7
blog post 5/9

I’ve posted some ideas for project ideas in this document.

Additional resources

Miscellaneous

This course was made possible by a grant from the Data@Carolina initiative and a ton of input from lots of very smart people.

This page was last updated on 2017-09-20 23:41:48 Eastern Time.