Assignment 2 is due 2/28/17. You can find the raw .Rmd file at: https://raw.githubusercontent.com/idc9/stor390/master/assignments/harry_potter/harry_potter.Rmd.

The text of all 7 Harry Potter books is available online: http://www.readfreeonline.net/Author/J._K._Rowling/Index.html (very possible the website is now defunct). In this assignment you will use dplyr, ggplot and regular expressions to do an exploratory analysis of Harry Potter and the Philosopher’s Stone.

Here are a couple examples of similar text analysis projects (that you will be able to do in a couple weeks!)

The Life-Changing Magic of Tidying Text by Julia Silge (yes janeaustenr is an entire R package devoted to Jane Austen)
Harry Potter agression by Andrew Heiss

Question 0

Set eval=FALSE for the chunk above and eval=TRUEfor the chunk below and all test chunks. The text file comes with the Sakai announcement.

# set up
library(tidyverse)
library(stringr)
text <- read_file('philosophers_stone.txt')

Question 1

How many words are in the book?

Question 2

How many times are each of the following characters mentioned? Display the answer using an appropriate visualization.

Harry, Hermione, Ron, Neville, Dumbledore, Draco, Snape, Hagrid, McGonagall

people <- c('Harry', 'Hermione', 'Ron', 'Neville', 'Dumbledore', 'Draco', 'Snape', 'Hagrid', 'McGonagall')

Question 3

Break the text into paragraphs; create a vector called paragraphs where each entry is a paragraph in the book.

# assume paragraphs end with \\\r\\\n

Question 4

Write a function that can break the text up into paragraphs, sentences, or words. This is a preview of what you’ll be doing in a couple weeks.

This function does not need to be perfect. For sentences, give one example where the function you wrote fails.

Hint: the function should probably have a if statement

# Assume a sentence ends with one of: .!?
# there are multiple valid definitions of word, just do something reasonable

unnest_tokens <- function(text, token='words'){
    # splits a string into tokens
    # input
        # text is a string
        # token can be one of: words, paragraphs, sentences
    # output: a character vector
    
}

# TODO add more
# Test code for the grader -- you don't have to modify these
sum(paragraphs == unnest_tokens(text, 'paragraphs'))

Question 5

Put the data into tidy format with one row per paragraph.

create a tibble called paragraph_df with one column text with the text of each paragraph (hint: you might need to use as.character(paragraphs))
add a new column index that gives the index of each paragraph
without using dplyr add a column called Harry that counts the number of times Harry is referenced in each paragraph

Hint: you can use question 2 to check your answer

Question 6

Write a function called reference_counter that generalizes question 5 for any tidy text data frame and any list of words.

Hint: do this without dplyr, use base R subsetting (i.e.[]). If df is a data frame and vec is a vector then df['blah'] <- vec will create a new column for df call blah.

reference_counter <- function(text_df, word_list){
    
    # inputs
        # text_df is a tibble with a column called text
        # word_list is a vector of strings
    # for each word in word_list add a column to text_df counting
    # the number of times that word appears in each row of text df
    # does not modify the original text_df
    # do this WITHOUT using dplyr

}

# test code for grader
test_words <- c('Harry', 'Hagrid', 'wand')
test_df <- reference_counter(paragraph_df, test_words)

test_df %>% select(Harry, Hagrid, wand) %>% summarise_all(sum)

Question 7

Using the reference_counter function update paragraph_df to include columns counting the number of references to each characters from Q2 in each paragraph

# test code for grader
paragraph_df[,people] %>% summarise_all(sum)

Question 8

Make a new data frame called person_refs with three columns: person, num_refs, index. num_refs is the number of references each person gets in paragraph and index is the index of the paragraph. Limit this data frame to the following 5 characters: Harry, Hermione, Ron, Draco, Neville.

Hint: use gather.

Make a bar plot showing the number of paragraphs that references each of the 5 characters

Now we want to examine how characters evolve over “time.” Plot the number of references vs. the paragraph index.

In this question we are using paragraphs for “time windows.” What are other “time windows” we could have used? What are some trade offs for these different choices.

Question 9

How often are Harry and Herminone referenced together? Plot the number of references per paragraph for Harry vs. Herminone.

one plot using geom_point
one plot using geom_jitter (use the width/height arguments of jitter to make the jitter plot look better)

Why is the jitter plot better than a simple point plot?

Question 10

Do Harry and Hermione tend to co-occur? Fit a linear regression of Harry vs. Hermione references per paragraph. Use the lm() function and print out the summary of the model.

Now use geom_smooth to plot the linear regression line on top of the jitter plot.

Question 11

Is there are relationship between the length of the paragraph and the number of times Harry is mentioned? Add a column called num_words to paragraph_df counting the number of words in each paragraph. Then use a linear regression to answer for the question. Provide both a statistical summary and a visualization.

Question 12

Create an indicator variable harry_mentioned that indicates whether or not Harry is mentioned in each paragraph. This indicator variable should be a factor.

Now do a linear regression with harry_mentioned as the x variable instead of the number of times he is mentioned.

Free response

Ask and answer a question with this data set. You should make at least 2 figures (e.g. plot, printout of a regression, etc). Provide a written explanation of the question and the evidence for your answer.

Harry Potter and regular expressions

STOR 390