This lecture covers web scraping with the rvest package and SelectorGaget.

Say you’ve decided you want the lyrics to every Katy Perry song. There are many websites that have song lyrics (e.g. www.songlyrics.com), but how do you get these lyrics onto your computer and into a useable format without spending the next three days copying and pasting?

Prerequisites

Before reading further watch this 2 minute video about SelectorGaget and

# library for webscraping
library(rvest)

library(tidyverse)
library(stringr)

Download the selectorgaget chrome extension (note you may need google chrome).

This lecture assumes you are familiar with regular expressions and the tidyverse (tibble data frames and the pipe operator)

The web: HTML and CSS

Most of the web is built out of: HTML, CSS and JavaScript. When you visit a webpage your computer sends a request to a web server which returns a bunch of text in the form of HTML. Your browser then renders that HTML text into the webpage that you see. You can actually view the HTML code (in Chrome: View > Developer > View Source), for example

Webpage (HBD Annie!)	Raw html

HTML stands for HyperText Markup Language. HTML deals with links and basic formatting. Hypertext is text that links to another webpage. A markup language displays text with formatting (e.g. bold, italics ~~strikethrough~~, etc) and turns text into images, tables, etc. R Mardown uses markdown with is a lightweight mark up language.

CSS stands for Cascading Style Sheets and is what makes webpages pretty. CSS allows for separation of presentation and content

CSS vs. no CSS (from http://html.com/css/)

To learn more about HTML check out Code Academy’s tutorial.

For our purposes we only need to understand enough HTML to access its contents (Iain knows more Katy Perry lyrics than HTML code…). With rvest and SelectorGaget we don’t need to know too much to do many basic tasks. It’s worth learning more so you can a) do more advanced scraping b) create your own websites.

An HTML document is make up of a hierachy of tags. The first division in the hierarchy is the head and the body. The head contains metadata about the webpage and the body contains the contents. The A tag identifies a hyper link. For example,

< a href=“http://www.songlyrics.com/katy-perry/i-kissed-a-girl-lyrics/” title=“I Kissed A Girl Lyrics Katy Perry”>I Kissed A Girl

displays a link to http://www.songlyrics.com/katy-perry/i-kissed-a-girl-lyrics/ with text I Kissed A Girl.

An HTML document is just a text file that follows specific patterns. Once you have the html text you could excract the information you want using regular expressions. This is doable, but a pain. Thanks to the open source community you don’t have to. We will use the rvest package. This is similar to the BeautifulSoup package in Python.

Scrape HTML

Let’s take a look at the lyrics of I Kissed a Girl from http://www.songlyrics.com/katy-perry/i-kissed-a-girl-lyrics/

Webpage	Raw html

Using rvest we can easily grab the html from this page

song_url <- 'http://www.songlyrics.com/katy-perry/i-kissed-a-girl-lyrics/'

html <- read_html(song_url)  # from rvest
html

## {xml_document}
## <html xmlns="https://www.w3.org/1999/xhtml" lang="en-US">
## [1] <head>\n<title>KATY PERRY - I KISSED A GIRL LYRICS</title>\n<meta ht ...
## [2] <body>\n<div id="fb-root"></div>\n<script>\n\n$(window).load(functio ...

Notice the first division in the html object is the <head> and <body>.

Grab the lyrics from one song

SelectorGaget identifies #songLyricsDiv as the CSS tag corresponding to just the song lyrics. Note you may have to click on a few things to find the exact tag you are looking for. Be aware that SelectorGaget will not always work perfectly.

#songLyricsDiv

Now we can grab the lyric text

lyrics <- html %>%
           html_nodes("#songLyricsDiv") %>%
           html_text()
lyrics

## [1] "This was never the way I planned\nNot my intention\nI got so brave\nDrink in hand\nLost my discretion\nIt's not what I'm used to\nJust wanna try you on\nI'm curious for you\nCaught my attention\nI kissed a girl\nAnd I liked it\nThe taste of her cherry chap stick\nI kissed a girl just to try it\nI hope my boyfriend don't mind it\nIt felt so wrong\nIt felt so right\nDon't mean I'm in love tonight\nI kissed a girl\nAnd I liked it\nI liked it\nNo, I don't even know your name\nIt doesn't matter\nYou're my experimental game\nJust human nature\nIt's not what\nGood girls do\nNot how they should behave\nMy head gets so confused\nHard to obey\nI kissed a girl\nAnd I liked it\nThe taste of her cherry chap stick\nI kissed a girl\nJust to try it\nI hope my boyfriend don't mind it\nIt felt so wrong\nIt felt so right\nDon't mean I'm in love tonight\nI kissed a girl\nAnd I liked it\nI liked it\nUs girls we are so magical\nSoft skin\nRed lips\nSo kissable\nHard to resist\nSo touchable\nToo good to deny it\nAin't no big deal\nIt's innocent\nI kissed a girl\nAnd I liked it\nThe taste of her cherry chap stick\nI kissed a girl just to try it\nI hope my boyfriend don't mind it\nIt felt so wrong\nIt felt so right\nDon't mean I'm in love tonight\nI kissed a girl\nAnd I liked it\nI liked it"

The html_nodes function grabs the node corresponding to the #songLyricsDiv tag; html_text extracts the text from this node.

Scrape every Katy Perry song

Now we would like to scape the lyrics to every Katy Perry song. For our purposes “every Katy Perry song” means “every Katy Perry song listed on song lyrics”. You should always be aware of data quality issues; songs might be missing, mislabeled, or duplicated.

Song lyrics lists 91 Katy Perry songs at http://www.songlyrics.com/katy-perry-lyrics/.

kp songs

Note that Last Friday Night is listed four times under:

Last Friday Night (T.G.I.F.) (Single)
Last Friday Night (T.G.I.F.) (Missy Elliott Remix)
Last Friday Night (T.G.I.F.)
Last Friday Night (T.G.I.F.) (featuring Missy Elliott Remix)

We could get rid of some duplicates if we were so inclined using some combination of heuristic deduplication rules (e.g. flag every song whose title contains another song’s title) and manual inspection.

Get all song titles

Using SelectorGaget again we find #colone-container .tracklist a is CSS tag for all of the song names.

artist_url <- 'http://www.songlyrics.com/katy-perry-lyrics/'

song_nodes <- read_html(artist_url) %>% # load the html
             html_nodes("#colone-container .tracklist a")

song_nodes[1:3]

## {xml_nodeset (3)}
## [1] <a itemprop="url" href="http://www.songlyrics.com/katy-perry/hook-up ...
## [2] <a itemprop="url" href="http://www.songlyrics.com/katy-perry/e-t-sin ...
## [3] <a itemprop="url" href="http://www.songlyrics.com/katy-perry/hackens ...

Notice the pattern of the artist_url: http://www.songlyrics.com/ARTIST-lyrics/ where ARTIST is the artist’s name in lower case with spaces replaced by -.

Now we have the nodes for each song we want to extract the song title

# grab the song titles
song_titles <-  html_text(song_nodes)
song_titles[1:3]

## [1] "Hook Up"                              
## [2] "E.T. (Single)"                        
## [3] "Hackensack (Fountains Of Wayne Cover)"

and the url to the song’s webpage.

# grab the song titles
song_links <-  html_attr(song_nodes, name='href')
song_links[1:3]

## [1] "http://www.songlyrics.com/katy-perry/hook-up-lyrics/"                            
## [2] "http://www.songlyrics.com/katy-perry/e-t-single-lyrics/"                         
## [3] "http://www.songlyrics.com/katy-perry/hackensack-fountains-of-wayne-cover-lyrics/"

Now we have all the pieces we need to scrape the lyrics to every Katy Perry song with the below for loop.

Warning

Websites often don’t like it when you scrape too much of their data and you can get banned from a website if you submit requests too frequently. Typically you (really your IP address) will be banned for 24 hours. Website vary in their permisivness and vengefullness; rumor on the street is that you can get yourself kicked off Facebook for attempting to scrape it improperly.

A common solution is to have the computer pause between requests (I used 10 second pauses for songlyrics). Websites don’t care about you viewing some of their data – they care about you downloading a lot of it in one fell swoop (think ad views and competitors). Figuring out an accepatble pause time can take some trial and error.

The pauses make this process slow (5 seconds * 90 songs ~= 8 minutes). If you want to download a lot of songs you will have to wait a while (you can get about 7000 songs over night). APIs and bulk downloads can make this process much faster (see below).

Putting it together: scape all KP lyrics

From the above code we can

get a list of Katy Perry songs
scrape the lyrics of an individual song

We can put this together to scrape the lyrics of every Katy Perry song on songlyrics.

# data frame to store 
lyrics <- tibble()
for(i in 1:length(song_links[1:2])){ # only grab 3 songs in the .rmd document
    
    # always nice to know where a long program is
    message(str_c('scraping song ', i, ' of ', length(song_links) ))
    
    # scape the text of a song
    lyrics_scraped <- song_links[i] %>%
                      read_html() %>% 
                      html_nodes("#songLyricsDiv") %>%
                      html_text()
    

    # format the song name for the data frame
    song_name <- song_titles[i] %>% 
                 str_to_lower() %>% 
                 gsub("[^[:alnum:] ]", "", .) %>%
                 gsub("[[:space:]]", "_", .)

    # add song to lyrics data frame
    lyrics <- rbind(lyrics, tibble(text=lyrics_scraped, artist = 'katy_perry', song=song_name) )
   
    # pause so we don't get banned!
    Sys.sleep(10) 
}

## scraping song 1 of 2688

## scraping song 2 of 2688

lyrics

## # A tibble: 2 × 3
##                                                                            text
## *                                                                         <chr>
## 1 Oh, sweetheart, put the bottle down\nYou've got too much talent\nI see you th
## 2 You're so hypnotizing\nCould you be the devil? Could you be an angel?\nYour t
## # ... with 2 more variables: artist <chr>, song <chr>

You should include a try-catch statment around the lyrics_scraped expression so that the scaper doesn’t stop if it fails to scrape a couple pages.

Mission accomplished – we have scraped the lyrics to every Katy Perry song. Ideally you would implement the above code as a function that takes an artist’s name as input and returns the song lyrics to all of their songs. Thanks to rvest and SelectorGaget it’s pretty easy to generalize this code to other websites.

References

The following are good references for using rvest

Web scraping/processing with rvest and stringr (Amazon reviews)
scraping wikipedia
scraping imdb
SelectorGaget vignette

HTML references

Getting data from the web

Jenny Bryan’s class has a section on web scraping
Using Python to Access Web Data on Coursera covers webscraping in more detail. Altough it’s in Python the lessons still apply.

Other ways to get song lyrics data

Most of the lyrics from the Million Song Dataset are available for download here.
MusixXmatch has an API where you can get 2000 song lyrics a day for free (it’s $25,000/year for more).
gennius.com also has an API.

Web Scraping and Katy Perry

STOR 390