This lecture covers string manipulations. The main references are
Some additional references
There are a bit rough at first. There is no secret to learn them other than bunch of practice.
A lot of data comes in the form of text. More importantly; messy data often can be cleaned up by treating it as text and editing—for example a column that should be numeric but that has commas in it.
For those reasons, you will want to know how to work with text in R, called string manipulation.
You will also see examples of how to apply visualization and summarization to text data as you’ve done with other types of data. These notes will assume you have read the resources below or will refer back to them when you need.
A few poeple took the time to type up the scirpts for a number of Disney movie scripts: http://www.fpx.de/fp/Disney/Scripts/. This lecture will use the script for Beauty and the Beast.
We will start with a quick application on a cleaned-up movie script, to use a few string manipulation tools from the resources. Then we will step back to show how string manipulation turned the raw text file into something usable.
This lecture is organized as follows
library(tidyverse)
library(stringr) # does not come with tidyverse
library(RColorBrewer)
# read in the cleaned beaut and the beast data frame
# see the .Rmd for the code that creates beauty_clean_df
beauty <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/beauty_clean_df.csv')
GASTON: How can you read this? There’s no pictures!
BELLE: Well, some people use their imaginations.
Disney movie fans have typed up scripts, and with text tools like the ones we’ll learn you can analyze them. For example, Polygraph recently looked at movie dialogue by gender for Disney movies and hundreds of other screenplays.
We will be working with one of those movies as an example: Beauty and the Beast. Here we will use a cleaned-up version of the data already put into a data frame. Later you will see how to get the data frame from the raw script.
Both base R and stringr
have string manipulation functions. The stringr
functions are easier to remember/use. The following list is from Jenny Bryan’s notes compare base R with stringr
grep(..., value = FALSE)
, grepl()
, stringr::str_detect()
grep(..., value = TRUE)
, stringr::str_extract()
, stringr::str_extract_all()
regexpr()
, gregexpr()
, stringr::str_locate()
, string::str_locate_all()
sub()
, gsub()
, stringr::str_replace()
, stringr::str_replace_all()
strsplit()
, stringr::str_split()
In general you should use stringr
.
There are two columns: - person (who spoke the line) - line (the line they spoke)
beauty[1, 2]
## # A tibble: 1 × 1
## line
## <chr>
## 1 Once upon a time, in a faraway land, a young prince lived in a;shining cast
str(beauty)
## Classes 'tbl_df', 'tbl' and 'data.frame': 697 obs. of 2 variables:
## $ person: chr "NARRATOR" "BELLE" "TOWNSFOLK 1" "TOWNSFOLK 2" ...
## $ line : chr "Once upon a time, in a faraway land, a young prince lived in a;shining castle. Although he had everything his heart desired,;t"| __truncated__ "Little town, it's a quiet village;Every day, like the one before;Little town, full of little people;Waking up to say...;" "Bonjour!" "Bonjour!" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 2
## .. ..$ person: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ line : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
Pattern matching is the main business of string manipulation. You will need it to find, replace or extract strings. Here are some examples of ways you might apply string matching tools from the resources to answer questions about Beauty and the Beast.
You can find out by matching each of their names to the person character vector, and you can use different methods.
# person vector
beauty$person[1:5]
## [1] "NARRATOR" "BELLE" "TOWNSFOLK 1" "TOWNSFOLK 2" "TOWNSFOLK 3"
sum(str_count(beauty$person, "GASTON")) # using stringr
## [1] 66
sum(grepl("GASTON", beauty$person)) # using base R
## [1] 66
sum(str_count(beauty$person, "BELLE"))
## [1] 137
That doesn’t tell us anything about how much time each of them speaks, since we are just counting numbers of uninterrupted lines. Let’s first remove the semicolon separators and count the number of characters each person has.
You could do this using tools in dplyr
as well, but let’s use the string manipulation techniques—first filtering the dataset using a string match for the person we want then counting characters with nchar
. You could also use str_count
from the stringr
package.
beauty$line[grepl("GASTON", beauty$person)] %>%
nchar %>% sum
## [1] 4903
beauty$line[str_detect(beauty$person, "GASTON")] %>%
nchar %>% sum
## [1] 4903
If this is the first time you are seeing regular expressions, go back and read the textbook. Those references go over the stringr
package in detail, so we won’t do that here.
A few points on general regular expression logic:
\([\ldots]\) matches any of the items in brackets
place the frequency quantifiers after the items they refer to. See the references for information on frequency quantifiers, which give constraints on how many times the pattern can be matched
Use parentheses for grouping patterns, typically to give specific quantifiers or to control order of operations. You can do more with parentheses, as we’ll see in the next section. For example pl?ot
will match pot
or plot
but (pl)?ot
matches plot
, lot
, pot
or ot
expression1|expression2
matches expression1 OR expression2.
expression1expression2
matches expression1 AND expression2
in the order given.
Think of AND as multiplication, OR as addition; the orders of operations are the same. As in algebra use parentheses to control the order of operations.
A quick example on that last point to demonstrate a string extraction using pattern matching.
# Matches capital letters AND subsequent numbers OR lower case letters AND subsequent punctuation
str_extract_all("TOWNFOLK2 townfolk!", "[A-Z]+[0-9]+|[a-z]+[[:punct:]]+")
## [[1]]
## [1] "TOWNFOLK2" "townfolk!"
# Matches (capitals AND numbers OR lower case) AND punctuation
str_extract_all("TOWNFOLK2 townfolk!", "([A-Z]+[0-9]+|[a-z]+)[[:punct:]]+")
## [[1]]
## [1] "townfolk!"
# Matches capitals AND (numbers OR lower case AND punctuation)
str_extract_all("TOWNFOLK2 townfolk!", "[A-Z]+([0-9]+|[a-z]+[[:punct:]]+)")
## [[1]]
## [1] "TOWNFOLK2"
We will do a lot more extraction in the data cleaning section below.
Let’s use some regular expressions to group the anonymous speakers in the script.
To catch the numbered person names, such as TOWNSFOLK 2, we use the exact beginning of the string—TOWNSFOLK—and allow the rest to match optionally with any digit or any space. The asterisk says to match the item directly preceding it as many times as it can and possible zero times.
Writing a ?
after the S
in BIMBETTES?
(I cringe to type that) will match the plural or the singular since the S will be matched optionally.
Since one thing you’ll do is to replicate the top-line number of male-female dialogue shares in the movie, we’ll keep gendered anonymous speaker names, such as ‘man.’
tolower
puts all text in lower case, which makes easier reading and referencing. Most text-matching functions—such as grepl
—have an option to ignore cases anyway when matching strings.
Unless you specify otherwise using regular expressions, patterns will be matched from left to right first. Many functions affect only the first match, though stringr
functions usually have a version returning using all matches.
In each case below, we are finding all patterns that match the first expression and replacing them with the second. CRONY 1
is replaced with crony
, for example.
beauty$person <- str_replace_all(beauty$person, "TOWNSFOLK[0-9\\s]*", "townsfolk") %>%
str_replace_all("CRONY[0-9\\s]*|CRONIES|OLD CRONIES", "crony") %>%
str_replace_all("WOMAN[0-9\\s]*|BIMBETTES?[0-9\\s]*", "woman") %>%
str_replace_all("MAN[0-9\\s]*|MEN", "man") %>%
str_replace_all("GROUP[0-9\\s]*|ALL|BOTH|CHORUS|OBJECTS|BYSTANDERS|MUGS|MOB", "group") %>%
tolower
Base R functions gsub
and sub
also allow you to substitute one pattern for another.
Let’s use some summary and grouping functions you already learned, making a drab barplot for the number of text characters for movie characters among the top 10.
You will do more in your homework.
group_by(beauty, person) %>%
summarise(N = sum(nchar(line))) %>%
arrange(desc(N)) %>% slice(1:10) %>%
ggplot(data = ., aes(x = person, y = N)) +
geom_bar(stat = "identity") +
theme_minimal() + theme(axis.text.x = element_text(angle=75, vjust=0.5, size=10), axis.title.x = element_blank())
Here’s an example of how you could use str_count
to plot the average word length for each line of dialogue over ‘time.’ We don’t actually have a time variable. But as a proxy, let’s use the cumulative number of characters of text in the dialogue. We’re focusing on the main characters too.
# demo to show that str_count here is counting the number of characters, in a line from Mrs. Potts
str_count("It's a guest, it's a guest! Sakes alive, well I'll be blessed!", "[A-z]")
## [1] 44
str_replace_all("It's a guest, it's a guest! Sakes alive, well I'll be blessed!", "[^A-z]", "") %>% nchar
## [1] 44
mutate(beauty,
line = str_replace(line, ";", " "),
N = nchar(line),
# Nc acts as a proxy for time
Nc = cumsum(N),
Nw = str_count(line, "\\w+"),
#don't want punctuation etc for word count
w_avg = str_count(line, "[A-z]") / Nw) %>%
filter(grepl("gaston|belle|beast|lumiere|lefou|cogsworth|mrs\\.potts|maurice|chip|featherduster|wardrobe|stove", person)) %>%
ggplot(data = ., aes(x = Nc, y = w_avg, color = person)) + geom_line(size = 2) +
theme_minimal() + scale_color_brewer(palette = 'Paired')
Sadly not too interesting. The average word length doesn’t change much over the course of the movie, except for some spikes. Maybe the word length dips at the end of the movie, as everyone is shouting at each other.
Here are the lines besting a 6-letter average. They’re short, with long words.
## # A tibble: 5 × 6
## person line N Nc
## <chr> <chr> <int> <int>
## 1 belle Morning monsieur! 17 1743
## 2 maurice Hideously ugly! 15 15899
## 3 lefou Plans to persecute harmless crackpots like Gaston 49 16594
## 4 lumiere Cascades... 11 25166
## 5 cogsworth Encroachers! 12 35731
## # ... with 2 more variables: Nw <int>, w_avg <dbl>
This section uses functions and sampling to get a sense of how rich each character’s vocabulary is. Polygraph did something similar for the vocabularies of rappers, which you should check out.
If you don’t understand what’s going on with some of the functions: Don’t worry.
This is just to give you another example of the kinds of things you can do with text data. You can always come back to review this bonus section when you are more comfortable with functions.
Let’s go step by step.
We want one unique word count for each character, not each line. There are several ways to do it. An example:
test <- filter(beauty, grepl("townsfolk|crony", person)) %>%
group_by(person) %>%
summarise(line = str_c(line, collapse = "; "), Nline = n(), Nw = str_count(line, "[\\w']+")) %>%
ungroup
vocab <- str_extract_all(test$line, "[\\w']+") %>% lapply(., FUN = function(s){length(unique(s))})
vocab <- data.frame(person = test$person, words_unique = unlist(vocab), Nline = test$Nline, Nw = test$Nw)
vocab
## person words_unique Nline Nw
## 1 crony 43 8 55
## 2 townsfolk 23 7 29
filter(beauty, grepl("townsfolk|crony", person))
## # A tibble: 15 × 2
## person
## <chr>
## 1 townsfolk
## 2 townsfolk
## 3 townsfolk
## 4 townsfolk
## 5 townsfolk
## 6 townsfolk
## 7 townsfolk
## 8 crony
## 9 crony
## 10 crony
## 11 crony
## 12 crony
## 13 crony
## 14 crony
## 15 crony
## # ... with 1 more variables: line <chr>
Now let’s do it on the data.
beauty_sum <- group_by(beauty, person) %>%
summarise(line = str_c(line, collapse = "; "), Nline = n(), Nw = str_count(line, "[\\w']+")) %>%
ungroup
vocab <- str_extract_all(beauty_sum$line, "[\\w']+") %>% lapply(., FUN = function(s){length(unique(s))})
vocab <- data.frame(person = beauty_sum$person, words_unique = unlist(vocab), Nline = beauty_sum$Nline, Nw = beauty_sum$Nw)
I’ve decided to handle contractions as one word. This is a good example of when a seemingly small change in your regular expression makes a meaningful difference.
We used this
str_extract_all("Little town, it's a quiet village;Every day, like the one before;Little town, full of little people;Waking up to say...;", "[\\w']+")
## [[1]]
## [1] "Little" "town" "it's" "a" "quiet" "village" "Every"
## [8] "day" "like" "the" "one" "before" "Little" "town"
## [15] "full" "of" "little" "people" "Waking" "up" "to"
## [22] "say"
not this
str_extract_all("Little town, it's a quiet village;Every day, like the one before;Little town, full of little people;Waking up to say...;", "\\w+")
## [[1]]
## [1] "Little" "town" "it" "s" "a" "quiet" "village"
## [8] "Every" "day" "like" "the" "one" "before" "Little"
## [15] "town" "full" "of" "little" "people" "Waking" "up"
## [22] "to" "say"
One possibility is to look at the number of unique words for each continuous line a character speaks.
ggplot(data = vocab, aes(x = person, y = words_unique / Nline)) +
geom_bar(stat = "identity") +
theme_minimal() + theme(axis.text.x = element_text(angle=75, vjust=0.5, size=10), axis.title.x = element_blank())
You could have guessed that didn’t make sense, but the graph shows it to you as well. The narrator has only one line, making her the best by this score. We don’t want our measure of vocabulary to be so heavily influence by something totally unrelated, such as the number of times a person speaks.
Maybe dividing by the total number of words spoken would be better.
ggplot(data = vocab, aes(x = person, y = words_unique / Nw)) +
geom_bar(stat = "identity") +
theme_minimal() + theme(axis.text.x = element_text(angle=75, vjust=0.5, size=10), axis.title.x = element_blank())
What’s going on here? The problem with that measure is the number of unique words increases more slowly than the total number of words since many words are repeated—‘and’, ‘the’ etc.
The characters who speak most (Belle, Gaston) automatically will do worse.
ggplot(data = vocab, aes(x = person, y = words_unique / Nw, fill = Nw)) +
geom_bar(stat = "identity") +
theme_minimal() + theme(axis.text.x = element_text(angle=75, vjust=0.5, size=10), axis.title.x = element_blank())
filter(beauty, grepl("belle", person)) %>% slice(1:3)
## # A tibble: 3 × 2
## person
## <chr>
## 1 belle
## 2 belle
## 3 belle
## # ... with 1 more variables: line <chr>
filter(beauty, grepl("wrestler", person))
## # A tibble: 1 × 2
## person line
## <chr> <chr>
## 1 wrestler In a wrestling match, nobody bites like Gaston
What is a more fair way to evaluate richness of vocabulary?
Let’s take a fixed number of words from each person, as the folks at Polygraph did when doing this for rapper vocabularies.
But different characters have different styles of speech in different situations. Sometimes a character is using repetitive exclamations. In Gaston’s self-promoting song, he uses a rich variety of words, including ‘expectorating’
To address that you could take large numbers of samples of fixed length and compute the unique words they have
So we will:
This example is a good one to review after reading the advanced programming lecture.
word_counter <- function(vocab_list, size){
lapply(vocab_list, FUN = function(s){
sample(s, size, replace = TRUE) %>% unique %>% length
}) %>% unlist
}
vocabulizer <- function(script, samples = 50, size = 100, size_factor = 1){
beauty_sum <- group_by(script, person) %>%
summarise(line = str_c(line, collapse = "; "), Nline = n(), Nw = str_count(line, "\\w+")) %>%
ungroup
beauty_sum <- filter(beauty_sum, Nw >= size * size_factor)
vocab <- str_extract_all(beauty_sum$line, "[\\w']+")
names(vocab) <- beauty_sum$person
out <- replicate(samples, word_counter(vocab, size = size)) %>%
# transpose to get people names in columns, samples in rows
t %>% as.data.frame
return(out)
}
# 200 samples of 100 words each
vocab_samples <- vocabulizer(beauty, samples = 200, size = 100)
gather(vocab_samples, key = person, value = words_unique) %>%
ggplot(aes(x = words_unique, fill = person)) + geom_density(alpha = .2) +
theme_minimal() + scale_fill_brewer(palette = 'Paired')
# with characters who speak more often and more samples
vocab_samples <- vocabulizer(beauty, samples = 1000, size = 100, size_factor = 2)
gather(vocab_samples, key = person, value = words_unique) %>%
ggplot(aes(x = words_unique, fill = person)) + geom_density(alpha = .2) +
theme_minimal() + scale_fill_brewer(palette = 'Paired')
# with bigger numbers of words in each sample
vocab_samples <- vocabulizer(beauty, samples = 1000, size = 500)
gather(vocab_samples, key = person, value = words_unique) %>%
ggplot(aes(x = words_unique, fill = person)) + geom_density(alpha = .5) +
theme_minimal() + scale_fill_brewer(palette = 'Paired')
The final graph shows us Cogsworth tends to have the most unique words per 500 total words spoken, followed by Gaston, Belle and Lumiere.
The data frame above was already set up for us. But the file started as a plain text document on a website, typed by a person associated with Central Michigan University.
Let’s rewind and show how string manipulation lets us take a slightly awkward dataset and turn it into the data frame we used above.
This is a transcribed version of the script for Beauty and the Beast, the cartoon version, in a text file.
beauty <- read_lines('http://www.fpx.de/fp/Disney/Scripts/BeautyAndTheBeast.txt')
beauty[1:10]
## [1] "<pre>"
## [2] "Beauty and the Beast"
## [3] "The Complete Script"
## [4] ""
## [5] "Compiled by Ben Scripps <34rqnpq@cmuvm.csv.cmich.edu>"
## [6] ""
## [7] "NARRATOR: Once upon a time, in a faraway land, a young prince lived in a"
## [8] " shining castle. Although he had everything his heart desired,"
## [9] " the prince was spoiled, selfish, and unkind. But then, one"
## [10] " winter's night, an old beggar woman came to the castle and"
A quick skimming shows some important structure we can use:
Lines of dialogue always begin with ‘SPEAKER NAME:’
Descriptions of the scene that are not dialogue are in parentheses.
Person names are in capital letters in scene descriptions as well but not in dialogue.
Dialogue in which multiple people speak at the same time have collective dialogue identifiers, such as CUPS or BOTH
A couple of dialogue identifiers have scene descriptions, for example ALL (esp. LUMIERE)
We do not have each cell representing a line of dialogue, since read_lines gave us a new cell in the vector every time text was separated by a return in the file.
Colons are very rarely used other than in dialogue identifiers—but they are used.
Dialogue identifiers sometimes include numbers, spaces and punctuation, such as CRONY 1 and MONSIEUR D’ARQUE.
Dialogue is in chronological order, the order in which lines are spoken.
Unlike in some previous lectures, we didn’t establish some clear goals for our analysis before diving in. But in this case, we know we want to explore dialogue attributed to individual speakers. So we likely will want a data frame with a structure to let us do that.
Goal: A data frame with one row per line of dialogue, with a column for the dialogue text and a column for the speaker name.
Plan ideas:
We can’t make a one-column data frame from our vector then split it into two columns using the tidyr package, since we do not have each cell representing a new line of dialogue.
The basic plan will be to
collapse the entire script into a single string
extract each new line of dialogue as its own cell in a vector, using the distinct structure of the dialogue identifiers
clean up little issues along the way
We usually can’t separate out individual speakers when dialogue is attributed to a group, so we won’t try. We also don’t care about the scene descriptions, so we will remove them.
These functions and regular expressions should look familiar from the resources section. Text that was loaded in different cells will be separated by a semicolon, which is what the collapse = ';'
argument says in the paste
function.
# read in the raw text file
# each line is an entry in a vector
beauty <- read_lines('http://www.fpx.de/fp/Disney/Scripts/BeautyAndTheBeast.txt', skip = 6)
typeof(beauty)
## [1] "character"
# collapse each line into a single string
# separate lines by a ;
beauty <- beauty%>%
str_trim(side = "both") %>%
paste(collapse = ";")
typeof(beauty)
## [1] "character"
# To avoid annoying issues later, since we don't try to distinguish individuals in group dialogue
beauty <- str_replace(beauty, " \\(ex. COGSWORTH\\):", ":") %>% str_replace(" \\(esp. LUMIERE\\):", ":")
We want a data frame with one column for the dialogue identifier (speaker name) and one for the line. Since every line starts with an identifier, we could try to:
and that should give us two vectors of equal length, matching the first speaker to the first line of dialogue, the second speaker to the second line and so on.
But that won’t work without extra tweaks. We’ll look at an example to see why, but first we need to learn a new feature of regular expressions.
Let’s take an example from the data, from when Gaston bribes the asylum keeper Monsieur D’Arque to incarcerate Belle’s father.
test <- "I don't usually leave the asylum in the middle of the night, but they said you'd make it worth my while. GASTON: It's like this. I've got my heart set on marrying Belle, but she needs a little persuasion."
How do we extract everything before the pattern GASTON:? Using what’s called a look-ahead.
A rundown of the syntax before we do examples:
pattern1(?=pattern2)
will match pattern1
only when it is followed by pattern2
pattern1(?!pattern2)
will match pattern1
when it is NOT followed by pattern2
(?<=pattern2)pattern1
and (?<!pattern2)pattern1
respectivelyLet start with something simpler than our example. The expression below matches candle
only when it is followed by the phrase stick
.
You need to use perl = TRUE
here for base R functions such as grepl
, but for stringr
package functions you do not need to do that. Don’t worry about what perl is right now. This is the only aspect of it we will need.
grepl('candle(?=stick)', c('candlestick', 'candlemaker', 'smart candle'), perl=TRUE)
## [1] TRUE FALSE FALSE
str_detect(c('candle!!?', 'candlemaker', 'smart candle%'), 'candle(?=[[:punct:]]+)')
## [1] TRUE FALSE TRUE
Look-behinds are similar, except that you can’t use the asterisk or plus quantifiers (which match as large a string as you want) for the pattern in parentheses.
This wouldn’t work: str_extract(c('candle!!?', 'candlemaker', 'smart candle%'), '(?<=[a-z]+)[[:punct:]]+')
Fix it by bounding if you can:
str_extract(c('candle!!?', 'candlemaker', 'smart candle%'), '(?<=[a-z]{1,10})[[:punct:]]+')
## [1] "!!?" NA "%"
Back to the question: How do you extract everything before GASTON:
in the test phrase above?
Use the catch-all .
str_extract(test, ".+(?=GASTON:)")
## [1] "I don't usually leave the asylum in the middle of the night, but they said you'd make it worth my while. "
Let’s use an example closer to what we will see in our data. We will need all dialogue between character names. Now there is a problem:
We have to change the look-ahead statement to get any character name, of variable lengths, some with punctuation and spaces. The pattern [A-Z\\s[:punct:]]+[:]
will match that.
test <- "BEAST: What are you doing here? MAURICE: Run, Belle! BEAST: The master of this castle. BELLE: I've come for my father. Please let him out! Can't you see he's sick? BEAST: Then he shouldn't have trespassed here."
str_extract(test, ".+(?=[A-Z\\s[:punct:]]+:)")
## [1] "BEAST: What are you doing here? MAURICE: Run, Belle! BEAST: The master of this castle. BELLE: I've come for my father. Please let him out! Can't you see he's sick? BEAS"
It returned almost everything and missed the final bit of dialogue. Here’s why:
.
(everything does) and is followed by the look-ahead expression.S
in the last case of BEAST:
, the look-ahead statement matched T:
and returned everything else before it.Here are some other failed attempts at getting what we want, to show other things you can do:
matching only lower cases, spaces and punctuation in the first pattern fails to pick up the capital letters starting sentences and proper names in the dialogue.
str_extract_all(test, "[a-z[:punct:]\\s]+(?=[A-Z\\s[:punct:]]+:)")
## [[1]]
## [1] "hat are you doing here? " "elle! "
## [3] "he master of this castle. " "an't you see he's sick? "
Matching specific punctuation allowing upper cases in patterns NOT followed by the character naming pattern returns nicer chunks but leaves the character identifiers in the preceding string. We want to be able to extract the separately.
str_extract_all(test, "[A-z;.,'!?\\s]+(?![A-Z\\s[:punct:]]+:)")
## [[1]]
## [1] "BEAST"
## [2] " What are you doing here? MAURICE"
## [3] " Run, Belle! BEAST"
## [4] " The master of this castle. BELLE"
## [5] " I've come for my father. Please let him out! Can't you see he's sick? BEAST"
## [6] " Then he shouldn't have trespassed here."
Notice we put specific punctuation in excluding the colon. But colons in our data do show up outside of person identifiers, so we can’t do that either.
Our basic problem is that the person identifiers are not good string separation criteria because they have too much in common with the rest of the text. That makes it difficult to use the look-ahead or other means for splitting the dataset in the way we want.
But the person identifiers are still distinct enough that we can match them—which means we can replace them with identifiers that are different enough from the dialogue to be good split criteria.
Extracting the person identifiers, adding some bogus lines to show this works for character names with punctuation and numbers:
test <- "BEAST: What are you doing here? MAURICE: Run, Belle! BEAST: The master of this castle. BELLE: I've come for my father. Please let him out! Can't you see he's sick? BEAST: Then he shouldn't have trespassed here. TOWNSFOLK 2: He's a monster! MRS. POTTS: Now pipe down!"
str_extract_all(test, "[A-Z]+[\\s0-9[:punct:]]*:|MRS. POTTS:")
## [[1]]
## [1] "BEAST:" "MAURICE:" "BEAST:" "BELLE:"
## [5] "BEAST:" "TOWNSFOLK 2:" "MRS. POTTS:"
To keep the MRS.
in MRS. POTTS
we have to handle it separately, with a regular expression or
statement.
Replacing them:
str_replace_all(test, "[A-Z]+[\\s0-9[:punct:]]*:|MRS. POTTS:", "&&&&&&&&")
## [1] "&&&&&&&& What are you doing here? &&&&&&&& Run, Belle! &&&&&&&& The master of this castle. &&&&&&&& I've come for my father. Please let him out! Can't you see he's sick? &&&&&&&& Then he shouldn't have trespassed here. &&&&&&&& He's a monster! &&&&&&&& Now pipe down!"
In the data, we will want to . Then we can , and .
Our new plan is:
A useful feature of str_replace_all
is that you can pass it a list of things to match and to replace. Review the lecture on data object types if you forget what a list is.
In our test example above, we could type:
str_replace_all(test, c("BEAST:" = "001>", "MAURICE:" = "002>", "BELLE:" = "003>"))
## [1] "001> What are you doing here? 002> Run, Belle! 001> The master of this castle. 003> I've come for my father. Please let him out! Can't you see he's sick? 001> Then he shouldn't have trespassed here. TOWNSFOLK 2: He's a monster! MRS. POTTS: Now pipe down!"
Alternatively, we could create a list object whose names are the patterns we want to replace and whose entries are the things we want to replace them with. Then we pass that to the function.
codes <- unique(str_extract_all(test, "[A-Z]+[\\s0-9[:punct:]]*:|MRS. POTTS:")[[1]])
codes_list <- as.list(paste0(seq(from = 100, to = 100 + length(codes) - 1), ">"))
names(codes_list) <- codes
codes_list
## $`BEAST:`
## [1] "100>"
##
## $`MAURICE:`
## [1] "101>"
##
## $`BELLE:`
## [1] "102>"
##
## $`TOWNSFOLK 2:`
## [1] "103>"
##
## $`MRS. POTTS:`
## [1] "104>"
test <- str_replace_all(test, codes_list)
test
## [1] "100> What are you doing here? 101> Run, Belle! 100> The master of this castle. 102> I've come for my father. Please let him out! Can't you see he's sick? 100> Then he shouldn't have trespassed here. 103> He's a monster! 104> Now pipe down!"
And that gives us what we want:
# Dialogue
str_extract_all(test, "[A-z[:punct:][:space:]]+(?![0-9]{3}>)")
## [[1]]
## [1] " What are you doing here?"
## [2] " Run, Belle!"
## [3] " The master of this castle."
## [4] " I've come for my father. Please let him out! Can't you see he's sick?"
## [5] " Then he shouldn't have trespassed here."
## [6] " He's a monster!"
## [7] " Now pipe down!"
#Speakers
str_extract_all(test, "[0-9]{3}>")
## [[1]]
## [1] "100>" "101>" "100>" "102>" "100>" "103>" "104>"
The major problem is solved, so the rest of the code is here for you to review. Little problems crop up and the fixes are noted.
# little annoying or statement for MRS. POTTS. Can't rearrage the previous statement without getting more
codes <- unique(str_extract_all(beauty, "[A-Z]+[\\s0-9[:punct:]]*:|MRS. POTTS:|OLD CRONIES:")[[1]])
codes_list <- as.list(paste0(seq(from = 100, to = 100 + length(codes) - 1), ">"))
names(codes_list) <- codes
beauty <- str_replace_all(beauty, codes_list) %>%
# final line ends with this
str_replace("</pre>", "") %>%
# first removing the scene descriptions between two brackets
str_replace_all("(\\(){1}[A-Za-z0-9\\s,.;:'!?]*(\\)){1}", "") %>%
# now getting the dangling brackets with open ends at the end of the line or dangling brackets
str_replace_all("(\\(){1}[A-Za-z0-9\\s,.;:'!?]*(\\))*", "")
beauty <- data.frame(person = str_extract_all(beauty, "[0-9]{3}>")[[1]],
line = str_extract_all(beauty, "[A-z[:punct:][:space:]]+(?![0-9]{3}>)")[[1]])
# Now switch the codes back to names and clean up a little
names(codes) <- unlist(codes_list)
codes <- as.list(codes)
beauty$person <- str_replace_all(beauty$person, codes) %>% str_replace_all(":", "")
beauty$line <- str_trim(beauty$line, side = "both") %>% str_replace_all(";+", ";")
tail(beauty)
## person
## 692 COGSWORTH
## 693 LUMIERE
## 694 CHIP
## 695 MRS. POTTS
## 696 CHIP
## 697 CHORUS
## line
## 692 You most certainly did not, you pompous parrafin-headed;pea-brain!
## 693 En garde, you overgrown pocket watch!
## 694 Are they gonna live happily ever after, mama?
## 695 Of course, my dear. Of course.
## 696 Do I still have to;sleep in the cupboard?;
## 697 Certain as the sun;Rising in the east;Tale as old as time, song as old as rhyme;Beauty and the beast!;Tale as old as time, song as old as rhyme;Beauty and the beast!;