Web scraping/processing with rvest and stringr

Note: this lecture was created by David Robinson for BIO 260 and borrowed with permission for STOR 390.

We’ve learned how to process ready-made datasets, as well as read them . But what if your data is on a website, formatted to be read by humans rather than read by R?

We’re going to learn to extract data from regular web pages so that it can be analyzed in R. This process is sometimes called “web-scraping” or “screen-scraping”, and the rvest package is a powerful tool for doing it.

Resources

rvest/CSS Selectors

stringr/regular expressions

Introduction to stringr
Regular Expressions/stringr tutorial
Regular Expression online tester- explains a regular expression as it is built, and confirms live whether and how it matches particular text.

Amazon Reviews

We’re going to be scraping this page: it just contains the (first page of) reviews of the ggplot2 book by Hadley Wickham.

library(dplyr)
library(stringr)

url <- "http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/product-reviews/0387981403/ref=cm_cr_dp_qt_see_all_top?ie=UTF8&showViewpoints=1&sortBy=helpful"

We use the rvest package to download this page.

library(rvest)

h <- read_html(url)

Now h is an xml_document that contains the contents of the page:

## {xml_document}
## <html class="a-no-js" data-19ax5a9jf="dingo">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body class="a-m-us a-aui_51744-c a-aui_57326-c a-aui_58736-t1 a-aui ...

How can you actually pull the interesting information out? That’s where CSS selectors come in.

CSS Selectors

CSS selectors are a way to specify a subset of nodes (that is, units of content) on a web page (e.g., just getting the titles of reviews). CSS selectors are very powerful and not too challenging to master- here’s a great tutorial But honestly you can get a lot done even with very little understanding, by using a tool called SelectorGadget.

Install the SelectorGadget on your web browser. (If you use Chrome you can use the Chrome extension, otherwise drag the provided link into your bookmarks bar). Here’s a guide for how to use it with rvest to “point-and-click” your way to a working selector.

For example, if you just wanted the titles, you’ll end up with a selector that looks something like .a-color-base. You can pipe your HTML object along with that selector into the html_nodes function, to select just those nodes:

h %>%
  html_nodes(".a-color-base")

## {xml_nodeset (10)}
##  [1] <a data-hook="review-title" class="a-size-base a-link-normal review ...
##  [2] <a data-hook="review-title" class="a-size-base a-link-normal review ...
##  [3] <a data-hook="review-title" class="a-size-base a-link-normal review ...
##  [4] <a data-hook="review-title" class="a-size-base a-link-normal review ...
##  [5] <a data-hook="review-title" class="a-size-base a-link-normal review ...
##  [6] <a data-hook="review-title" class="a-size-base a-link-normal review ...
##  [7] <a data-hook="review-title" class="a-size-base a-link-normal review ...
##  [8] <a data-hook="review-title" class="a-size-base a-link-normal review ...
##  [9] <a data-hook="review-title" class="a-size-base a-link-normal review ...
## [10] <a data-hook="review-title" class="a-size-base a-link-normal review ...

But you need the text from each of these, not the full tags. Pipe to the html_text function to pull these out:

review_titles <- h %>%
  html_nodes(".a-color-base") %>%
  html_text()

review_titles

##  [1] "Nice resource, but already out of date"                                                      
##  [2] "Still a great package and highly worth learning - but the text is getting quite out of date."
##  [3] "Good book - avoid the kindle edition"                                                        
##  [4] "Tippping point for R data visualization."                                                    
##  [5] "A new era for statistical graphics"                                                          
##  [6] "Ok, but not current"                                                                         
##  [7] "A classic"                                                                                   
##  [8] "Excellent content, poor adaptation to kindle"                                                
##  [9] "Ggplot2 - graphs that made me leave excel and use R"                                         
## [10] "This classic book is now 49 (dog) years old..."

Now we’ve extracted something useful! Similarly, let’s grab the format (hardcover or paperback). Some experimentation with SelectorGadget shows it’s:

h %>%
  html_nodes(".a-size-mini.a-color-secondary") %>%
  html_text()

##  [1] "Format: Paperback"      "Format: Paperback"     
##  [3] "Format: Kindle Edition" "Format: Kindle Edition"
##  [5] "Format: Paperback"      "Format: Paperback"     
##  [7] "Format: Paperback"      "Format: Kindle Edition"
##  [9] "Format: Paperback"      "Format: Paperback"

Now, we may be annoyed that it always starts with Format:. Let’s introduce the stringr package.

library(stringr)

formats <- h %>%
  html_nodes(".a-size-mini.a-color-secondary") %>%
  html_text() %>%
  str_replace("Format: ", "")

formats

##  [1] "Paperback"      "Paperback"      "Kindle Edition" "Kindle Edition"
##  [5] "Paperback"      "Paperback"      "Paperback"      "Kindle Edition"
##  [9] "Paperback"      "Paperback"

Number of stars

Next, let’s get the number of stars. Some clicking with SelectorGadget finds an selector expression that will work:

h %>%
  html_nodes("#cm_cr-review_list .review-rating")

## {xml_nodeset (10)}
##  [1] <i data-hook="review-star-rating" class="a-icon a-icon-star a-star- ...
##  [2] <i data-hook="review-star-rating" class="a-icon a-icon-star a-star- ...
##  [3] <i data-hook="review-star-rating" class="a-icon a-icon-star a-star- ...
##  [4] <i data-hook="review-star-rating" class="a-icon a-icon-star a-star- ...
##  [5] <i data-hook="review-star-rating" class="a-icon a-icon-star a-star- ...
##  [6] <i data-hook="review-star-rating" class="a-icon a-icon-star a-star- ...
##  [7] <i data-hook="review-star-rating" class="a-icon a-icon-star a-star- ...
##  [8] <i data-hook="review-star-rating" class="a-icon a-icon-star a-star- ...
##  [9] <i data-hook="review-star-rating" class="a-icon a-icon-star a-star- ...
## [10] <i data-hook="review-star-rating" class="a-icon a-icon-star a-star- ...

We can confirm these are the right tags (and there are ten of them, just like there are ten titles- good start). There’s more going on in these that we don’t need to worry about (they aren’t just text, they’re replaced with images in the web page), but using html_text still gets out relevant text:

h %>%
  html_nodes("#cm_cr-review_list .review-rating") %>%
  html_text()

##  [1] "4.0 out of 5 stars" "3.0 out of 5 stars" "3.0 out of 5 stars"
##  [4] "4.0 out of 5 stars" "5.0 out of 5 stars" "2.0 out of 5 stars"
##  [7] "5.0 out of 5 stars" "5.0 out of 5 stars" "4.0 out of 5 stars"
## [10] "3.0 out of 5 stars"

Now we need to pull out just the digit, 1-5. This can be done with regular expressions. Regular expressions are very powerful tools for working with text through “patterns”- see here for one resource.

We’ll use the stringr package:

h %>%
  html_nodes("#cm_cr-review_list .review-rating") %>%
  html_text() %>%
  str_extract("\\d")

##  [1] "4" "3" "3" "4" "5" "2" "5" "5" "4" "3"

Note that we piped the character vector to the str_extract pattern, which pulls out the parts within a string that match a pattern. The \\d pattern means a digit (that is, 1-9).

Finally, we have to turn them from a character vector to a numeric vector:

number_stars <- h %>%
  html_nodes("#cm_cr-review_list .review-rating") %>%
  html_text() %>%
  str_extract("\\d") %>%
  as.numeric()

number_stars

##  [1] 4 3 3 4 5 2 5 5 4 3

The same applies to the number of people that found a review useful. Let’s collect that too:

h %>%
  html_nodes("#cm_cr-review_list .review-votes") %>%
  html_text()

##  [1] "\n          42 people found this helpful.\n        " 
##  [2] "\n          12 people found this helpful.\n        " 
##  [3] "\n          14 people found this helpful.\n        " 
##  [4] "\n          7 people found this helpful.\n        "  
##  [5] "\n          7 people found this helpful.\n        "  
##  [6] "\n          6 people found this helpful.\n        "  
##  [7] "\n          2 people found this helpful.\n        "  
##  [8] "\n          One person found this helpful.\n        "
##  [9] "\n          2 people found this helpful.\n        "  
## [10] "\n          2 people found this helpful.\n        "

The difference is that here we don’t want just one digit- there could be multiple. We can add a + (meaning “one or more”) to the regular expression to the \\d to match that:

h %>%
  html_nodes("#cm_cr-review_list .review-votes") %>%
  html_text() %>%
  str_extract("\\d+")

##  [1] "42" "12" "14" "7"  "7"  "6"  "2"  NA   "2"  "2"

You’ll still need as.numeric():

number_helpful <- h %>%
  html_nodes("#cm_cr-review_list .review-votes") %>%
  html_text() %>%
  str_extract("\\d+") %>%
  as.numeric()

number_helpful

##  [1] 42 12 14  7  7  6  2 NA  2  2

Now we have all our data, from the first page:

ret <- tibble(review_titles, formats, number_stars, number_helpful)
ret

## # A tibble: 10 × 4
##                                                                  review_titles
##                                                                          <chr>
## 1                                       Nice resource, but already out of date
## 2  Still a great package and highly worth learning - but the text is getting q
## 3                                         Good book - avoid the kindle edition
## 4                                     Tippping point for R data visualization.
## 5                                           A new era for statistical graphics
## 6                                                          Ok, but not current
## 7                                                                    A classic
## 8                                 Excellent content, poor adaptation to kindle
## 9                          Ggplot2 - graphs that made me leave excel and use R
## 10                              This classic book is now 49 (dog) years old...
## # ... with 3 more variables: formats <chr>, number_stars <dbl>,
## #   number_helpful <dbl>

Multiple pages

Take a look at the URL for the second page:

http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/product-reviews/0387981403/ref=undefined_2?ie=UTF8&showViewpoints=1&sortBy=helpful&pageNumber=2

Notice that pageNumber=2 at the end? Try adding a few values there. We see we can get all 5 URLs easily.

url_base <- "http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/product-reviews/0387981403/ref=undefined_2?ie=UTF8&showViewpoints=1&sortBy=helpful&pageNumber="
urls <- paste0(url_base, 1:5)
urls

## [1] "http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/product-reviews/0387981403/ref=undefined_2?ie=UTF8&showViewpoints=1&sortBy=helpful&pageNumber=1"
## [2] "http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/product-reviews/0387981403/ref=undefined_2?ie=UTF8&showViewpoints=1&sortBy=helpful&pageNumber=2"
## [3] "http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/product-reviews/0387981403/ref=undefined_2?ie=UTF8&showViewpoints=1&sortBy=helpful&pageNumber=3"
## [4] "http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/product-reviews/0387981403/ref=undefined_2?ie=UTF8&showViewpoints=1&sortBy=helpful&pageNumber=4"
## [5] "http://www.amazon.com/ggplot2-Elegant-Graphics-Data-Analysis/product-reviews/0387981403/ref=undefined_2?ie=UTF8&showViewpoints=1&sortBy=helpful&pageNumber=5"

We may then want to scrape and combine all reviews. The way I like to do this to create a read_page_reviews function, then to use lapply and dplyr’s bind_rows to combine them:

read_page_reviews <- function(url) {
  title <- h %>%
    html_nodes(".a-color-base") %>%
    html_text()
  
  format <- h %>%
    html_nodes(".a-size-mini.a-color-secondary") %>%
    html_text()
  
  helpful <- h %>%
    html_nodes("#cm_cr-review_list .review-votes") %>%
    html_text() %>%
    str_extract("\\d+") %>%
    as.numeric()
  
  stars <- h %>%
    html_nodes("#cm_cr-review_list .review-rating") %>%
    html_text() %>%
    str_extract("\\d") %>%
    as.numeric()

  data_frame(title, format, stars, helpful)
}

ggplot2_reviews <- bind_rows(lapply(urls, read_page_reviews))

knitr::kable(ggplot2_reviews)

title	format	stars	helpful
Nice resource, but already out of date	Format: Paperback	4	42
Still a great package and highly worth learning - but the text is getting quite out of date.	Format: Paperback	3	12
Good book - avoid the kindle edition	Format: Kindle Edition	3	14
Tippping point for R data visualization.	Format: Kindle Edition	4	7
A new era for statistical graphics	Format: Paperback	5	7
Ok, but not current	Format: Paperback	2	6
A classic	Format: Paperback	5	2
Excellent content, poor adaptation to kindle	Format: Kindle Edition	5	NA
Ggplot2 - graphs that made me leave excel and use R	Format: Paperback	4	2
This classic book is now 49 (dog) years old…	Format: Paperback	3	2
Nice resource, but already out of date	Format: Paperback	4	42
Still a great package and highly worth learning - but the text is getting quite out of date.	Format: Paperback	3	12
Good book - avoid the kindle edition	Format: Kindle Edition	3	14
Tippping point for R data visualization.	Format: Kindle Edition	4	7
A new era for statistical graphics	Format: Paperback	5	7
Ok, but not current	Format: Paperback	2	6
A classic	Format: Paperback	5	2
Excellent content, poor adaptation to kindle	Format: Kindle Edition	5	NA
Ggplot2 - graphs that made me leave excel and use R	Format: Paperback	4	2
This classic book is now 49 (dog) years old…	Format: Paperback	3	2
Nice resource, but already out of date	Format: Paperback	4	42
Still a great package and highly worth learning - but the text is getting quite out of date.	Format: Paperback	3	12
Good book - avoid the kindle edition	Format: Kindle Edition	3	14
Tippping point for R data visualization.	Format: Kindle Edition	4	7
A new era for statistical graphics	Format: Paperback	5	7
Ok, but not current	Format: Paperback	2	6
A classic	Format: Paperback	5	2
Excellent content, poor adaptation to kindle	Format: Kindle Edition	5	NA
Ggplot2 - graphs that made me leave excel and use R	Format: Paperback	4	2
This classic book is now 49 (dog) years old…	Format: Paperback	3	2
Nice resource, but already out of date	Format: Paperback	4	42
Still a great package and highly worth learning - but the text is getting quite out of date.	Format: Paperback	3	12
Good book - avoid the kindle edition	Format: Kindle Edition	3	14
Tippping point for R data visualization.	Format: Kindle Edition	4	7
A new era for statistical graphics	Format: Paperback	5	7
Ok, but not current	Format: Paperback	2	6
A classic	Format: Paperback	5	2
Excellent content, poor adaptation to kindle	Format: Kindle Edition	5	NA
Ggplot2 - graphs that made me leave excel and use R	Format: Paperback	4	2
This classic book is now 49 (dog) years old…	Format: Paperback	3	2
Nice resource, but already out of date	Format: Paperback	4	42
Still a great package and highly worth learning - but the text is getting quite out of date.	Format: Paperback	3	12
Good book - avoid the kindle edition	Format: Kindle Edition	3	14
Tippping point for R data visualization.	Format: Kindle Edition	4	7
A new era for statistical graphics	Format: Paperback	5	7
Ok, but not current	Format: Paperback	2	6
A classic	Format: Paperback	5	2
Excellent content, poor adaptation to kindle	Format: Kindle Edition	5	NA
Ggplot2 - graphs that made me leave excel and use R	Format: Paperback	4	2
This classic book is now 49 (dog) years old…	Format: Paperback	3	2