These notes show an example text classification with the tidytext package in R. They assume you are familiar with chapters 1, 3, and 5 from Tidy text mining with R and the nearest centroid classifier.

The data can be found here.

library(tidyverse)
library(stringr)
library(tidytext)

library(klaR) # mean difference classifier

# see github repo for this script
source('read_text.R')

Load “raw” data

For this section we will use a collection of 40 of George Orwell’s essays scraped from here.

# read in the raw text
books <- read_author('orwell_essays')
books
## # A tibble: 6,144 × 4
##                                                                           text
## *                                                                        <chr>
## 1  Some years ago a friend took me to the little Berkshire church of which the
## 2  was once the incumbent. (Actually it is a few miles from Bray, but perhaps 
## 3  the churchyard there stands a magnificent yew tree which, according to a no
## 4  person than the Vicar of Bray himself. And it struck me at the time as curi
## 5                                                            relic behind him.
## 6  The Vicar of Bray, though he was well equipped to be a leader-writer on THE
## 7  admirable character. Yet, after this lapse of time, all that is left of him
## 8  has rested the eyes of generation after generation and must surely have out
## 9                                                   his political quislingism.
## 10 Thibaw, the last King of Burma, was also far from being a good man. He was 
## # ... with 6,134 more rows, and 3 more variables: linenumber <int>,
## #   author <chr>, book <chr>

The goal of this section is to illustrate a classification example with text data. The classes in this case are the essay titles. Right now we have one observation per class i.e. the whole text of the essay. To make things more interesting we are going to split each text into 10 line chunks of text.

# convert books into chunks
chunks <- books %>% 
    mutate(chunk = str_c(book,'_', linenumber %/% 10))

dim(chunks)
## [1] 6144    5

Tokenization

Breaking a document up into words is called tokenization.

There are currently 6144 observations (i.e. one 10 line chunk from an essay is an observation). Using the unnest_tokens function from the tidytext package we can turn this data frame into a tidy data frame where the rows are individual words (i.e. tokenization).

chunk_words <- chunks %>%
    unnest_tokens(word, text) 

chunk_words
## # A tibble: 121,561 × 5
##    linenumber        author                              book
##         <int>         <chr>                             <chr>
## 1           1 orwell_essays a_good_word_for_the_vicar_of_bray
## 2           1 orwell_essays a_good_word_for_the_vicar_of_bray
## 3           1 orwell_essays a_good_word_for_the_vicar_of_bray
## 4           1 orwell_essays a_good_word_for_the_vicar_of_bray
## 5           1 orwell_essays a_good_word_for_the_vicar_of_bray
## 6           1 orwell_essays a_good_word_for_the_vicar_of_bray
## 7           1 orwell_essays a_good_word_for_the_vicar_of_bray
## 8           1 orwell_essays a_good_word_for_the_vicar_of_bray
## 9           1 orwell_essays a_good_word_for_the_vicar_of_bray
## 10          1 orwell_essays a_good_word_for_the_vicar_of_bray
## # ... with 121,551 more rows, and 2 more variables: chunk <chr>,
## #   word <chr>

Now we count the number of times each word occurs in each chunk

chunk_words <- chunk_words %>% 
    count(chunk, word, sort = TRUE) %>%
    ungroup() %>% 
    rename(count=n)

chunk_words
## # A tibble: 78,102 × 3
##                                  chunk  word count
##                                  <chr> <chr> <int>
## 1  politics_and_the_english_language_2   the   109
## 2  politics_and_the_english_language_1   the    97
## 3  politics_and_the_english_language_1    of    87
## 4  politics_and_the_english_language_2    of    80
## 5  politics_and_the_english_language_2    to    56
## 6  politics_and_the_english_language_2    is    54
## 7  politics_and_the_english_language_2     a    53
## 8  politics_and_the_english_language_3   the    53
## 9  politics_and_the_english_language_2   and    52
## 10 politics_and_the_english_language_1   and    46
## # ... with 78,092 more rows

TF-IDF

The raw word counts are not always ideal. For example, the word “the” shows up a lot, but this is not very informative. Therefore, we can down-weight commonly occurring words using term frequency, inverse document frequency scores.

Using the bind_tf_idf function from tidytext we can compute the tf-idf scores.

chunk_words <- chunk_words %>% 
    bind_tf_idf(word, chunk, count)


chunk_words
## # A tibble: 78,102 × 6
##                                  chunk  word count         tf         idf
##                                  <chr> <chr> <int>      <dbl>       <dbl>
## 1  politics_and_the_english_language_2   the   109 0.05366814 0.004720701
## 2  politics_and_the_english_language_1   the    97 0.05918243 0.004720701
## 3  politics_and_the_english_language_1    of    87 0.05308115 0.007880261
## 4  politics_and_the_english_language_2    of    80 0.03938946 0.007880261
## 5  politics_and_the_english_language_2    to    56 0.02757262 0.023829563
## 6  politics_and_the_english_language_2    is    54 0.02658789 0.172635495
## 7  politics_and_the_english_language_2     a    53 0.02609552 0.038404720
## 8  politics_and_the_english_language_3   the    53 0.04944030 0.004720701
## 9  politics_and_the_english_language_2   and    52 0.02560315 0.011049836
## 10 politics_and_the_english_language_1   and    46 0.02806589 0.011049836
## # ... with 78,092 more rows, and 1 more variables: tf_idf <dbl>

Document term matrix

The cast_dtm will turn the tidy data frame into a document term matrix. This object is from the tm package which you may need to install first. We actually make two document term matrices: one with the raw word counts and one with the tf-idf scores.

# install,packages('tm')

# convert to dtm matrix
bag_of_words_dtm <- chunk_words %>% cast_dtm(chunk, word, count)
tfidf_dtm <- chunk_words %>% cast_dtm(chunk, word, tf_idf)

First let’s grab the book titles to use as training labels.

# training classes (i.e. book titles)
# yes this is super hacky
row_names <- bag_of_words_dtm$dimnames$Docs
tr_classes <- str_sub(str_extract(row_names, '[a-z_]+'), start=1, end=-2)%>% factor

tr_classes[0:5]
## [1] politics_and_the_english_language        
## [2] politics_and_the_english_language        
## [3] politics_and_the_english_language        
## [4] politics_and_the_english_language        
## [5] down_the_mine__from_the_road_to_wigan_pie
## 40 Levels: a_good_word_for_the_vicar_of_bray ... you_and_the_atomic_bomb

The package we will use for classification doesn’t like the above format so let’s turn it into a regular R matrix. Warning: if you have a large data set a regular R matrix will be too slow and memory intensive to work with.

X_bag_of_words <- as.matrix(bag_of_words_dtm)
X_tfidf <- as.matrix(tfidf_dtm)

dim(X_bag_of_words)
## [1]   637 12177

Nearest centroid classifier

Now we have both the X (document term matrices) and Y data (book titles). Let’s fit the nearest centroid classifier on the document term matrices (this is the nm function from the klaR package). We actually fit two classifiers: one on the raw word counts (bag of words) and another on the tf-idf matrix.

# fit mean difference classifier (AKA nearest centroid or nearest mean)
bow_classifier <- nm(x=X_bag_of_words, grouping=tr_classes)
tfidf_classifier <- nm(x=X_tfidf, grouping=tr_classes)

# get training predictions
bow_tr_pred <- predict(bow_classifier, newdata = X_bag_of_words)$class
tfidf_tr_pred <- predict(tfidf_classifier, newdata = X_tfidf)$class

# training error
paste0('bag of words based classifier training error: ', mean(tr_classes != bow_tr_pred))
## [1] "bag of words based classifier training error: 0.19309262166405"
paste0('tf-idf based classifier training error: ', mean(tr_classes != tfidf_tr_pred))
## [1] "tf-idf based classifier training error: 0.0565149136577708"