The acronym API stands for Application Programming Interface. An API is simply a way for developers of a software to make that software’s functionality available to others programmatically. For instance, the developers of Google Maps, Microsoft Office, and Twitter have created packages in various programming languages to make it so that we can program against their software.

Obviously, these applications already have interfaces that we know and love. The interfaces we generally interact with in our day to day are GUI’s (Graphical User Interfaces). GUIs tend to be intuitive to use and nice to look at, but they are not good for large workloads. Suppose, for instance, you wanted to code 30,000 addresses into latitude, longitude points. You could easily do this by using Google search for 5 to 10 points. But 30,000 would cause real distress. Fortunately, Google has provided its Maps API which will do this for you in about 5 minutes. More over, the first 2,500 codings are free, and 100,000 is $50! As this example shows, the value of APIs is their ability to remove human interaction from software usage where it is unwanted and/or unneeded.

In this class we will mainly be concerned with API’s that grant access to data. In addition to Google’s Maps API, one can find API’s by Facebook, Yahoo, Tumbler, and many, many more by Google. Private companies are not the only sources of API data access, either. The US Census Bureau has an API which grants us access to demographics info from the various survey’s they conduct. Once one starts looking, it becomes apparent that everyone wants to share access to their data.

Prerequisites

You will need to get an API key for Google maps if you want to run the code in this lecture (the one under Authentication for the standard API — API keys) .

library(jsonlite)

library(tidyverse)

# you should keep your API key a secret
my_key <- 'XXXXXXX' # copy and paste your API key here

Access

Alright, well every one wants us to have their data, and we want to have it. Let’s go get that data and start making magic!

… of course there are details to trip us up.

So how does one solve a real world problem using APIs?

1. recognize the problem, hope that an API exists, and start Googling

Unlike school, no one is going to tell us to go fix an easy issue with a known API in the work-a-day world. So when we are presented with a problem that requires data we do not posses, we must imagine the data that we want and check whether it has been made available to us. This, arguably, is the most important step in using APIs for analytics.

2. read the docs

Once a potential API is discovered, read the documentation to determine whether the data we need is available. The particular fields matter, so we must check what they are. While we are determining what data is available, we should also look for a couple of key pieces of info.

  1. What is the request format?
    • Often in the form of a http://….. request, it may also be offered through a package in R, Python, or JavaScript
  2. What is the return type
    • Expect json or xml

3. sign up if necessary

Many APIs are provided only with authorization, even when access is free.

4. request some sample data and review

To start with the actual programming it is best to start with 10 to 100 data points. This will allow us to practice making requests and to get a look at the file format and the data structure. We should expect text file formats to be returned, common formats include json and xml. The structure of the data, may be highly nested and require some work upon receipt to “tiddy” up.

5. write formatting functions against the sample data

Use the sample data to write a couple small functions to turn the API’s data into a “tiddy” data set. Working on the small data set for the programming tasks will allow you to solve problems in less time, review function output easily during development, and potentially save you from wasting valuable (and possibly limited) requests against the API.

6. get the full data set and parse

Once step 5 is complete, make a full request against the API and receive the dataset. Use the functions produced in step 5 to parse the data set. Store the dataset in a useful format (csv, json), in the project files before continuing with analysis. Put your API request code and analysis code in separate files.

Google Maps API

Suppose we are working for a company that has three datasets relating addresses to latitude, longitude coordinates and want to determine the most accurate. What do we do? Geocoding.

STEP 1.

The problem here is that there is no ground truth. How can we know which is THE most accurate, if we do not know where the latitudes and longitudes are supposed to point.

…. if only we had data that told us this thing.

Is there an API which provides reliable or defensible address to lat, lon geocoding? Let’s use the Google Maps API.

STEP 2.

The Google machine will tell you that the docs live here: https://developers.google.com/maps/documentation/. It looks like we have access to geocoding here: https://developers.google.com/maps/documentation/geocoding/start. A sample request looks like this:

https://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&key=YOUR_API_KEY

so we need to use the base string:

https://maps.googleapis.com/maps/api/geocode/json

with

address=AN+ADDRESS+OF+INTEREST

and our API key attached

key=YOUR_API_KEY

it looks like data will come back in JSON format.

STEP 3.

Let’s get those API keys from Google

STEP 4.

Most of the work will be done here. First we need the set of addresses to investigate. These are found in the file “place_locs.csv”

# Load dataset
locs_df = read_csv("place_locs.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   lat = col_double(),
##   lon = col_double(),
##   street = col_character(),
##   city = col_character(),
##   state = col_character()
## )
head(locs_df)
## # A tibble: 6 × 6
##       id      lat        lon                    street         city state
##    <int>    <dbl>      <dbl>                     <chr>        <chr> <chr>
## 1 110367 41.39887  -75.67384     1000 S Washington Ave     Scranton    PA
## 2 164548 47.60989 -122.33487              1420 5th Ave      Seattle    WA
## 3 355416 40.85991  -96.68195            4900 N 27th St      Lincoln    NE
## 4 394769 38.93515  -76.95040       3831 Bladensburg Rd Colmar Manor    MD
## 5 123365 43.31304  -87.92505 1020 Port Washington Road      Grafton    WI
## 6 361732 33.76289 -116.30069    39615 WASHINGTON STE H  PALM DESERT    CA

We are going to need to pull addresses from this df, convert them to URLs, request the data, and then transform it for use. We will address these in order.

  1. We will build a function to create urls from two arguments, a dataframe row and our Google key. This will cover steps one and two above. Notice the various functionalities used in the building of this routine, they will come in handy the next time this task is necessary.
# URL building function
row_to_url <- function(row, key) {
    # set the URL base
    basestring = "https://maps.googleapis.com/maps/api/geocode/json?"
    
    # replace white space with "+" in individual address fields
    street = gsub(" ", "+", row$street)
    city = gsub(" ", "+", row$city) 
    state = gsub(" ", "+", row$state)
    
    # concatenate subfields of address
    address = paste(street, city, state, sep=",+")
    address = paste0("address=", address)

    # create url string from base and addres parts        
    urlstring = paste0(basestring, address)

    # add key to URL and return
    key = paste0("key=", key)
    urlwithkey = paste(urlstring, key, sep="&")
    
    return(urlwithkey)
}

Review to ensure proper behavior.

row_to_url(locs_df[1, ], my_key)
## [1] "https://maps.googleapis.com/maps/api/geocode/json?address=1000+S+Washington+Ave,+Scranton,+PA&key=AIzaSyAW--YNqKDQButqmt3lRtVktT0TtoRpD7s"

Now lets request some sample data.

url_1 = row_to_url(locs_df[1, ], my_key)
url_1
## [1] "https://maps.googleapis.com/maps/api/geocode/json?address=1000+S+Washington+Ave,+Scranton,+PA&key=AIzaSyAW--YNqKDQButqmt3lRtVktT0TtoRpD7s"

We can use the readLines( ) function to retrieve data from the API by calling readLines(aURL).

dat = readLines(url_1)
dat
##  [1] "{"                                                                                   
##  [2] "   \"results\" : ["                                                                  
##  [3] "      {"                                                                             
##  [4] "         \"address_components\" : ["                                                 
##  [5] "            {"                                                                       
##  [6] "               \"long_name\" : \"1000\","                                            
##  [7] "               \"short_name\" : \"1000\","                                           
##  [8] "               \"types\" : [ \"street_number\" ]"                                    
##  [9] "            },"                                                                      
## [10] "            {"                                                                       
## [11] "               \"long_name\" : \"South Washington Avenue\","                         
## [12] "               \"short_name\" : \"S Washington Ave\","                               
## [13] "               \"types\" : [ \"route\" ]"                                            
## [14] "            },"                                                                      
## [15] "            {"                                                                       
## [16] "               \"long_name\" : \"South Side\","                                      
## [17] "               \"short_name\" : \"South Side\","                                     
## [18] "               \"types\" : [ \"neighborhood\", \"political\" ]"                      
## [19] "            },"                                                                      
## [20] "            {"                                                                       
## [21] "               \"long_name\" : \"Scranton\","                                        
## [22] "               \"short_name\" : \"Scranton\","                                       
## [23] "               \"types\" : [ \"locality\", \"political\" ]"                          
## [24] "            },"                                                                      
## [25] "            {"                                                                       
## [26] "               \"long_name\" : \"Lackawanna County\","                               
## [27] "               \"short_name\" : \"Lackawanna County\","                              
## [28] "               \"types\" : [ \"administrative_area_level_2\", \"political\" ]"       
## [29] "            },"                                                                      
## [30] "            {"                                                                       
## [31] "               \"long_name\" : \"Pennsylvania\","                                    
## [32] "               \"short_name\" : \"PA\","                                             
## [33] "               \"types\" : [ \"administrative_area_level_1\", \"political\" ]"       
## [34] "            },"                                                                      
## [35] "            {"                                                                       
## [36] "               \"long_name\" : \"United States\","                                   
## [37] "               \"short_name\" : \"US\","                                             
## [38] "               \"types\" : [ \"country\", \"political\" ]"                           
## [39] "            },"                                                                      
## [40] "            {"                                                                       
## [41] "               \"long_name\" : \"18505\","                                           
## [42] "               \"short_name\" : \"18505\","                                          
## [43] "               \"types\" : [ \"postal_code\" ]"                                      
## [44] "            }"                                                                       
## [45] "         ],"                                                                         
## [46] "         \"formatted_address\" : \"1000 S Washington Ave, Scranton, PA 18505, USA\","
## [47] "         \"geometry\" : {"                                                           
## [48] "            \"location\" : {"                                                        
## [49] "               \"lat\" : 41.3996784,"                                                
## [50] "               \"lng\" : -75.674358"                                                 
## [51] "            },"                                                                      
## [52] "            \"location_type\" : \"ROOFTOP\","                                        
## [53] "            \"viewport\" : {"                                                        
## [54] "               \"northeast\" : {"                                                    
## [55] "                  \"lat\" : 41.4010273802915,"                                       
## [56] "                  \"lng\" : -75.67300901970849"                                      
## [57] "               },"                                                                   
## [58] "               \"southwest\" : {"                                                    
## [59] "                  \"lat\" : 41.3983294197085,"                                       
## [60] "                  \"lng\" : -75.67570698029149"                                      
## [61] "               }"                                                                    
## [62] "            }"                                                                       
## [63] "         },"                                                                         
## [64] "         \"place_id\" : \"ChIJoXle6iPfxIkR14xARxe5SvI\","                            
## [65] "         \"types\" : [ \"street_address\" ]"                                         
## [66] "      }"                                                                             
## [67] "   ],"                                                                               
## [68] "   \"status\" : \"OK\""                                                              
## [69] "}"

The value dat is in string format, we use the jsonlite to package to convert.

datdf = fromJSON(dat)

Looking at the returned data we see that it is, in fact, pretty heavily nested.

datdf$results
##                                                                                                                                                                                                                                                                                                                                                                address_components
## 1 1000, South Washington Avenue, South Side, Scranton, Lackawanna County, Pennsylvania, United States, 18505, 1000, S Washington Ave, South Side, Scranton, Lackawanna County, PA, US, 18505, street_number, route, neighborhood, political, locality, political, administrative_area_level_2, political, administrative_area_level_1, political, country, political, postal_code
##                                formatted_address geometry.location.lat
## 1 1000 S Washington Ave, Scranton, PA 18505, USA              41.39968
##   geometry.location.lng geometry.location_type
## 1             -75.67436                ROOFTOP
##   geometry.viewport.northeast.lat geometry.viewport.northeast.lng
## 1                        41.40103                       -75.67301
##   geometry.viewport.southwest.lat geometry.viewport.southwest.lng
## 1                        41.39833                       -75.67571
##                      place_id          types
## 1 ChIJoXle6iPfxIkR14xARxe5SvI street_address

The latitude can be accessed as follows.

datdf$results$geometry$location$lat
## [1] 41.39968

The longitude is accessed similarly.

datdf$results$geometry$location$lng
## [1] -75.67436

And there appears to be a useful location type field here.

datdf$results$geometry$location_type
## [1] "ROOFTOP"

STEP 5.

Now we come to the task of transforming the data for use. For any returned piece of data we will need the lat, lon, and a join key to relate the new info to the inputs. Let’s go ahead and collect up the location_type field as well. For a join key we will use the inputs themselves. Ultimately we would like a dataframe that has the join keys and the new info together for use with the initial data. Let’s write a couple of functions to do this for us.

The functions we need are:

# an outer function to control the main process
# recieves the dataframe full of info for request and the google key
# returns a dataframe with the join keys and new data together in a row
map_data <- function(df, key)

# make the request, recieve the json data, parse and tranform to an R dataframe
get_and_parse <- function(urlstring)

# transform a single request data frame into a useful format and append to the dataframe df
# returns the extended dataframe
add_row <- function(df, jsndat)

Now let’s implement the concepts stubbed out above. We will need some new libraries.

get_and_parse <- function(urlstring) {
    dat = readLines(urlstring)
    datdf = fromJSON(dat)
    return(datdf)
}

add_row <- function(df, jsndat, reqdat) {
    # expects df to have columns [street, city, state, lat, lon, loctype]
    # req dat is a row of data used for request [street, city, state]
    resdf = data.frame(
        street = reqdat$street,
        city = reqdat$city,
        state = reqdat$state,
        lat = jsndat$results$geometry$location$lat,
        lon = jsndat$results$geometry$location$lng,
        loctype = jsndat$results$geometry$location_type
    )
    return(rbind(df, resdf))
}

map_data <- function(df, key) {
    # expects df to have columns [street, city, state]
    
    # initialize a dataframe with appropriate columns using the first row of
    # our request data
    reqrow = df[1, c("street", "city", "state")]
    qryurl = row_to_url(reqrow, key)
    jsndf = get_and_parse(qryurl)

    newdf = data.frame(
        street = reqrow$street,
        city = reqrow$city,
        state = reqrow$state,
        lat = jsndf$results$geometry$location$lat,
        lon = jsndf$results$geometry$location$lng,
        loctype = jsndf$results$geometry$location_type
    )

    # populate the dataframe with the rest of the request data
    for (ix in 2:nrow(df)) {
        reqrow = df[ix, c("street", "city", "state")]
        qryurl = row_to_url(reqrow, key)
        jsndf = get_and_parse(qryurl)
        newdf = add_row(newdf, jsndf, reqrow)
    }

    return(newdf)
}

Trying our functions on the top 10 rows of the request dataframe yields the desired results.

map_data(locs_df[1:10, ], my_key)
##                       street         city state      lat        lon
## 1      1000 S Washington Ave     Scranton    PA 41.39968  -75.67436
## 2               1420 5th Ave      Seattle    WA 47.61046 -122.33461
## 3             4900 N 27th St      Lincoln    NE 40.85979  -96.67740
## 4        3831 Bladensburg Rd Colmar Manor    MD 38.93476  -76.94978
## 5  1020 Port Washington Road      Grafton    WI 43.32333  -87.92469
## 6     39615 WASHINGTON STE H  PALM DESERT    CA 33.76225 -116.30135
## 7     8849 Villa La Jolla Dr     La Jolla    CA 32.86943 -117.23185
## 8        5705 Deerfield Blvd        Mason    OH 39.30727  -84.31667
## 9        249 Summit Park Dr.   Pittsburgh    PA 40.44956  -80.17794
## 10  10751 WEST OVERLAND ROAD        BOISE    ID 43.58948 -116.31624
##        loctype
## 1      ROOFTOP
## 2      ROOFTOP
## 3      ROOFTOP
## 4  APPROXIMATE
## 5      ROOFTOP
## 6      ROOFTOP
## 7      ROOFTOP
## 8      ROOFTOP
## 9      ROOFTOP
## 10     ROOFTOP

STEP 6.

At this point it is appropriate to make the full request and pull down the API data. With this code it takes about a minute and a half to pull 100 requests. We are bound by the speed of the network interactions here. Remember to to save off the requested data before continuing with an analysis so that it is not lost.

Summary

As we can see using these web based APIs requires a number of skills. We must be able to find APIs that meet our needs and determine their correct usage. We will need to do some string parsing to for query URLs in order to make the requests. Lastly, we will need to transform the returned data to prep it for analysis.

Addtional Resources