Assignment 2 is due 2/28/17. You can find the raw .Rmd file at: https://raw.githubusercontent.com/idc9/stor390/master/assignments/harry_potter/harry_potter.Rmd.
The text of all 7 Harry Potter books is available online: http://www.readfreeonline.net/Author/J._K._Rowling/Index.html (very possible the website is now defunct). In this assignment you will use dplyr, ggplot and regular expressions to do an exploratory analysis of Harry Potter and the Philosopher’s Stone.
Here are a couple examples of similar text analysis projects (that you will be able to do in a couple weeks!)
The Life-Changing Magic of Tidying Text by Julia Silge (yes janeaustenr is an entire R package devoted to Jane Austen)
Harry Potter agression by Andrew Heiss
Set eval=FALSE
for the chunk above and eval=TRUE
for the chunk below and all test chunks. The text file comes with the Sakai announcement.
# set up
library(tidyverse)
library(stringr)
text <- read_file('philosophers_stone.txt')
How many words are in the book?
#
How many times are each of the following characters mentioned? Display the answer using an appropriate visualization.
people <- c('Harry', 'Hermione', 'Ron', 'Neville', 'Dumbledore', 'Draco', 'Snape', 'Hagrid', 'McGonagall')
Break the text into paragraphs; create a vector called paragraphs
where each entry is a paragraph in the book.
# assume paragraphs end with \\\r\\\n
Write a function that can break the text up into paragraphs, sentences, or words. This is a preview of what you’ll be doing in a couple weeks.
This function does not need to be perfect. For sentences, give one example where the function you wrote fails.
Hint: the function should probably have a if statement
# Assume a sentence ends with one of: .!?
# there are multiple valid definitions of word, just do something reasonable
unnest_tokens <- function(text, token='words'){
# splits a string into tokens
# input
# text is a string
# token can be one of: words, paragraphs, sentences
# output: a character vector
}
# TODO add more
# Test code for the grader -- you don't have to modify these
sum(paragraphs == unnest_tokens(text, 'paragraphs'))
Put the data into tidy format with one row per paragraph.
paragraph_df
with one column text
with the text of each paragraph (hint: you might need to use as.character(paragraphs)
)index
that gives the index of each paragraphHarry
that counts the number of times Harry is referenced in each paragraph#
Hint: you can use question 2 to check your answer
Write a function called reference_counter
that generalizes question 5 for any tidy text data frame and any list of words.
Hint: do this without dplyr, use base R subsetting (i.e.[]
). If df
is a data frame and vec
is a vector then df['blah'] <- vec
will create a new column for df
call blah
.
reference_counter <- function(text_df, word_list){
# inputs
# text_df is a tibble with a column called text
# word_list is a vector of strings
# for each word in word_list add a column to text_df counting
# the number of times that word appears in each row of text df
# does not modify the original text_df
# do this WITHOUT using dplyr
}
# test code for grader
test_words <- c('Harry', 'Hagrid', 'wand')
test_df <- reference_counter(paragraph_df, test_words)
test_df %>% select(Harry, Hagrid, wand) %>% summarise_all(sum)
Using the reference_counter
function update paragraph_df
to include columns counting the number of references to each characters from Q2 in each paragraph
#
# test code for grader
paragraph_df[,people] %>% summarise_all(sum)
Make a new data frame called person_refs
with three columns: person, num_refs, index. num_refs is the number of references each person gets in paragraph and index is the index of the paragraph. Limit this data frame to the following 5 characters: Harry, Hermione, Ron, Draco, Neville.
Hint: use gather
.
#
Make a bar plot showing the number of paragraphs that references each of the 5 characters
#
Now we want to examine how characters evolve over “time.” Plot the number of references vs. the paragraph index.
#
In this question we are using paragraphs for “time windows.” What are other “time windows” we could have used? What are some trade offs for these different choices.
How often are Harry and Herminone referenced together? Plot the number of references per paragraph for Harry vs. Herminone.
geom_point
geom_jitter
(use the width/height arguments of jitter to make the jitter plot look better)#
#
Why is the jitter plot better than a simple point plot?
Do Harry and Hermione tend to co-occur? Fit a linear regression of Harry vs. Hermione references per paragraph. Use the lm()
function and print out the summary
of the model.
#
Now use geom_smooth
to plot the linear regression line on top of the jitter plot.
#
Is there are relationship between the length of the paragraph and the number of times Harry is mentioned? Add a column called num_words
to paragraph_df
counting the number of words in each paragraph. Then use a linear regression to answer for the question. Provide both a statistical summary and a visualization.
#
Create an indicator variable harry_mentioned
that indicates whether or not Harry is mentioned in each paragraph. This indicator variable should be a factor.
#
Now do a linear regression with harry_mentioned
as the x variable instead of the number of times he is mentioned.
#
Ask and answer a question with this data set. You should make at least 2 figures (e.g. plot, printout of a regression, etc). Provide a written explanation of the question and the evidence for your answer.