The acronym API stands for Application Programming Interface. An API is simply a way for developers of a software to make that software’s functionality available to others programmatically. For instance, the developers of Google Maps, Microsoft Office, and Twitter have created packages in various programming languages to make it so that we can program against their software.
Obviously, these applications already have interfaces that we know and love. The interfaces we generally interact with in our day to day are GUI’s (Graphical User Interfaces). GUIs tend to be intuitive to use and nice to look at, but they are not good for large workloads. Suppose, for instance, you wanted to code 30,000 addresses into latitude, longitude points. You could easily do this by using Google search for 5 to 10 points. But 30,000 would cause real distress. Fortunately, Google has provided its Maps API which will do this for you in about 5 minutes. More over, the first 2,500 codings are free, and 100,000 is $50! As this example shows, the value of APIs is their ability to remove human interaction from software usage where it is unwanted and/or unneeded.
In this class we will mainly be concerned with API’s that grant access to data. In addition to Google’s Maps API, one can find API’s by Facebook, Yahoo, Tumbler, and many, many more by Google. Private companies are not the only sources of API data access, either. The US Census Bureau has an API which grants us access to demographics info from the various survey’s they conduct. Once one starts looking, it becomes apparent that everyone wants to share access to their data.
You will need to get an API key for Google maps if you want to run the code in this lecture (the one under Authentication for the standard API — API keys
) .
library(jsonlite)
library(tidyverse)
# you should keep your API key a secret
my_key <- 'XXXXXXX' # copy and paste your API key here
Alright, well every one wants us to have their data, and we want to have it. Let’s go get that data and start making magic!
… of course there are details to trip us up.
So how does one solve a real world problem using APIs?
Unlike school, no one is going to tell us to go fix an easy issue with a known API in the work-a-day world. So when we are presented with a problem that requires data we do not posses, we must imagine the data that we want and check whether it has been made available to us. This, arguably, is the most important step in using APIs for analytics.
Once a potential API is discovered, read the documentation to determine whether the data we need is available. The particular fields matter, so we must check what they are. While we are determining what data is available, we should also look for a couple of key pieces of info.
Many APIs are provided only with authorization, even when access is free.
To start with the actual programming it is best to start with 10 to 100 data points. This will allow us to practice making requests and to get a look at the file format and the data structure. We should expect text file formats to be returned, common formats include json and xml. The structure of the data, may be highly nested and require some work upon receipt to “tiddy” up.
Use the sample data to write a couple small functions to turn the API’s data into a “tiddy” data set. Working on the small data set for the programming tasks will allow you to solve problems in less time, review function output easily during development, and potentially save you from wasting valuable (and possibly limited) requests against the API.
Once step 5 is complete, make a full request against the API and receive the dataset. Use the functions produced in step 5 to parse the data set. Store the dataset in a useful format (csv, json), in the project files before continuing with analysis. Put your API request code and analysis code in separate files.
Suppose we are working for a company that has three datasets relating addresses to latitude, longitude coordinates and want to determine the most accurate. What do we do? Geocoding.
The problem here is that there is no ground truth. How can we know which is THE most accurate, if we do not know where the latitudes and longitudes are supposed to point.
…. if only we had data that told us this thing.
Is there an API which provides reliable or defensible address to lat, lon geocoding? Let’s use the Google Maps API.
The Google machine will tell you that the docs live here: https://developers.google.com/maps/documentation/. It looks like we have access to geocoding here: https://developers.google.com/maps/documentation/geocoding/start. A sample request looks like this:
so we need to use the base string:
with
address=AN+ADDRESS+OF+INTEREST
and our API key attached
key=YOUR_API_KEY
it looks like data will come back in JSON format.
Let’s get those API keys from Google
Most of the work will be done here. First we need the set of addresses to investigate. These are found in the file “place_locs.csv”
# Load dataset
locs_df = read_csv("place_locs.csv")
## Parsed with column specification:
## cols(
## id = col_integer(),
## lat = col_double(),
## lon = col_double(),
## street = col_character(),
## city = col_character(),
## state = col_character()
## )
head(locs_df)
## # A tibble: 6 × 6
## id lat lon street city state
## <int> <dbl> <dbl> <chr> <chr> <chr>
## 1 110367 41.39887 -75.67384 1000 S Washington Ave Scranton PA
## 2 164548 47.60989 -122.33487 1420 5th Ave Seattle WA
## 3 355416 40.85991 -96.68195 4900 N 27th St Lincoln NE
## 4 394769 38.93515 -76.95040 3831 Bladensburg Rd Colmar Manor MD
## 5 123365 43.31304 -87.92505 1020 Port Washington Road Grafton WI
## 6 361732 33.76289 -116.30069 39615 WASHINGTON STE H PALM DESERT CA
We are going to need to pull addresses from this df, convert them to URLs, request the data, and then transform it for use. We will address these in order.
# URL building function
row_to_url <- function(row, key) {
# set the URL base
basestring = "https://maps.googleapis.com/maps/api/geocode/json?"
# replace white space with "+" in individual address fields
street = gsub(" ", "+", row$street)
city = gsub(" ", "+", row$city)
state = gsub(" ", "+", row$state)
# concatenate subfields of address
address = paste(street, city, state, sep=",+")
address = paste0("address=", address)
# create url string from base and addres parts
urlstring = paste0(basestring, address)
# add key to URL and return
key = paste0("key=", key)
urlwithkey = paste(urlstring, key, sep="&")
return(urlwithkey)
}
Review to ensure proper behavior.
row_to_url(locs_df[1, ], my_key)
## [1] "https://maps.googleapis.com/maps/api/geocode/json?address=1000+S+Washington+Ave,+Scranton,+PA&key=AIzaSyAW--YNqKDQButqmt3lRtVktT0TtoRpD7s"
Now lets request some sample data.
url_1 = row_to_url(locs_df[1, ], my_key)
url_1
## [1] "https://maps.googleapis.com/maps/api/geocode/json?address=1000+S+Washington+Ave,+Scranton,+PA&key=AIzaSyAW--YNqKDQButqmt3lRtVktT0TtoRpD7s"
We can use the readLines( )
function to retrieve data from the API by calling readLines(aURL)
.
dat = readLines(url_1)
dat
## [1] "{"
## [2] " \"results\" : ["
## [3] " {"
## [4] " \"address_components\" : ["
## [5] " {"
## [6] " \"long_name\" : \"1000\","
## [7] " \"short_name\" : \"1000\","
## [8] " \"types\" : [ \"street_number\" ]"
## [9] " },"
## [10] " {"
## [11] " \"long_name\" : \"South Washington Avenue\","
## [12] " \"short_name\" : \"S Washington Ave\","
## [13] " \"types\" : [ \"route\" ]"
## [14] " },"
## [15] " {"
## [16] " \"long_name\" : \"South Side\","
## [17] " \"short_name\" : \"South Side\","
## [18] " \"types\" : [ \"neighborhood\", \"political\" ]"
## [19] " },"
## [20] " {"
## [21] " \"long_name\" : \"Scranton\","
## [22] " \"short_name\" : \"Scranton\","
## [23] " \"types\" : [ \"locality\", \"political\" ]"
## [24] " },"
## [25] " {"
## [26] " \"long_name\" : \"Lackawanna County\","
## [27] " \"short_name\" : \"Lackawanna County\","
## [28] " \"types\" : [ \"administrative_area_level_2\", \"political\" ]"
## [29] " },"
## [30] " {"
## [31] " \"long_name\" : \"Pennsylvania\","
## [32] " \"short_name\" : \"PA\","
## [33] " \"types\" : [ \"administrative_area_level_1\", \"political\" ]"
## [34] " },"
## [35] " {"
## [36] " \"long_name\" : \"United States\","
## [37] " \"short_name\" : \"US\","
## [38] " \"types\" : [ \"country\", \"political\" ]"
## [39] " },"
## [40] " {"
## [41] " \"long_name\" : \"18505\","
## [42] " \"short_name\" : \"18505\","
## [43] " \"types\" : [ \"postal_code\" ]"
## [44] " }"
## [45] " ],"
## [46] " \"formatted_address\" : \"1000 S Washington Ave, Scranton, PA 18505, USA\","
## [47] " \"geometry\" : {"
## [48] " \"location\" : {"
## [49] " \"lat\" : 41.3996784,"
## [50] " \"lng\" : -75.674358"
## [51] " },"
## [52] " \"location_type\" : \"ROOFTOP\","
## [53] " \"viewport\" : {"
## [54] " \"northeast\" : {"
## [55] " \"lat\" : 41.4010273802915,"
## [56] " \"lng\" : -75.67300901970849"
## [57] " },"
## [58] " \"southwest\" : {"
## [59] " \"lat\" : 41.3983294197085,"
## [60] " \"lng\" : -75.67570698029149"
## [61] " }"
## [62] " }"
## [63] " },"
## [64] " \"place_id\" : \"ChIJoXle6iPfxIkR14xARxe5SvI\","
## [65] " \"types\" : [ \"street_address\" ]"
## [66] " }"
## [67] " ],"
## [68] " \"status\" : \"OK\""
## [69] "}"
The value dat
is in string format, we use the jsonlite
to package to convert.
datdf = fromJSON(dat)
Looking at the returned data we see that it is, in fact, pretty heavily nested.
datdf$results
## address_components
## 1 1000, South Washington Avenue, South Side, Scranton, Lackawanna County, Pennsylvania, United States, 18505, 1000, S Washington Ave, South Side, Scranton, Lackawanna County, PA, US, 18505, street_number, route, neighborhood, political, locality, political, administrative_area_level_2, political, administrative_area_level_1, political, country, political, postal_code
## formatted_address geometry.location.lat
## 1 1000 S Washington Ave, Scranton, PA 18505, USA 41.39968
## geometry.location.lng geometry.location_type
## 1 -75.67436 ROOFTOP
## geometry.viewport.northeast.lat geometry.viewport.northeast.lng
## 1 41.40103 -75.67301
## geometry.viewport.southwest.lat geometry.viewport.southwest.lng
## 1 41.39833 -75.67571
## place_id types
## 1 ChIJoXle6iPfxIkR14xARxe5SvI street_address
The latitude can be accessed as follows.
datdf$results$geometry$location$lat
## [1] 41.39968
The longitude is accessed similarly.
datdf$results$geometry$location$lng
## [1] -75.67436
And there appears to be a useful location type field here.
datdf$results$geometry$location_type
## [1] "ROOFTOP"
Now we come to the task of transforming the data for use. For any returned piece of data we will need the lat, lon, and a join key to relate the new info to the inputs. Let’s go ahead and collect up the location_type field as well. For a join key we will use the inputs themselves. Ultimately we would like a dataframe that has the join keys and the new info together for use with the initial data. Let’s write a couple of functions to do this for us.
The functions we need are:
# an outer function to control the main process
# recieves the dataframe full of info for request and the google key
# returns a dataframe with the join keys and new data together in a row
map_data <- function(df, key)
# make the request, recieve the json data, parse and tranform to an R dataframe
get_and_parse <- function(urlstring)
# transform a single request data frame into a useful format and append to the dataframe df
# returns the extended dataframe
add_row <- function(df, jsndat)
Now let’s implement the concepts stubbed out above. We will need some new libraries.
get_and_parse <- function(urlstring) {
dat = readLines(urlstring)
datdf = fromJSON(dat)
return(datdf)
}
add_row <- function(df, jsndat, reqdat) {
# expects df to have columns [street, city, state, lat, lon, loctype]
# req dat is a row of data used for request [street, city, state]
resdf = data.frame(
street = reqdat$street,
city = reqdat$city,
state = reqdat$state,
lat = jsndat$results$geometry$location$lat,
lon = jsndat$results$geometry$location$lng,
loctype = jsndat$results$geometry$location_type
)
return(rbind(df, resdf))
}
map_data <- function(df, key) {
# expects df to have columns [street, city, state]
# initialize a dataframe with appropriate columns using the first row of
# our request data
reqrow = df[1, c("street", "city", "state")]
qryurl = row_to_url(reqrow, key)
jsndf = get_and_parse(qryurl)
newdf = data.frame(
street = reqrow$street,
city = reqrow$city,
state = reqrow$state,
lat = jsndf$results$geometry$location$lat,
lon = jsndf$results$geometry$location$lng,
loctype = jsndf$results$geometry$location_type
)
# populate the dataframe with the rest of the request data
for (ix in 2:nrow(df)) {
reqrow = df[ix, c("street", "city", "state")]
qryurl = row_to_url(reqrow, key)
jsndf = get_and_parse(qryurl)
newdf = add_row(newdf, jsndf, reqrow)
}
return(newdf)
}
Trying our functions on the top 10 rows of the request dataframe yields the desired results.
map_data(locs_df[1:10, ], my_key)
## street city state lat lon
## 1 1000 S Washington Ave Scranton PA 41.39968 -75.67436
## 2 1420 5th Ave Seattle WA 47.61046 -122.33461
## 3 4900 N 27th St Lincoln NE 40.85979 -96.67740
## 4 3831 Bladensburg Rd Colmar Manor MD 38.93476 -76.94978
## 5 1020 Port Washington Road Grafton WI 43.32333 -87.92469
## 6 39615 WASHINGTON STE H PALM DESERT CA 33.76225 -116.30135
## 7 8849 Villa La Jolla Dr La Jolla CA 32.86943 -117.23185
## 8 5705 Deerfield Blvd Mason OH 39.30727 -84.31667
## 9 249 Summit Park Dr. Pittsburgh PA 40.44956 -80.17794
## 10 10751 WEST OVERLAND ROAD BOISE ID 43.58948 -116.31624
## loctype
## 1 ROOFTOP
## 2 ROOFTOP
## 3 ROOFTOP
## 4 APPROXIMATE
## 5 ROOFTOP
## 6 ROOFTOP
## 7 ROOFTOP
## 8 ROOFTOP
## 9 ROOFTOP
## 10 ROOFTOP
At this point it is appropriate to make the full request and pull down the API data. With this code it takes about a minute and a half to pull 100 requests. We are bound by the speed of the network interactions here. Remember to to save off the requested data before continuing with an analysis so that it is not lost.
As we can see using these web based APIs requires a number of skills. We must be able to find APIs that meet our needs and determine their correct usage. We will need to do some string parsing to for query URLs in order to make the requests. Lastly, we will need to transform the returned data to prep it for analysis.
R has a long list of packages for getting data from the internet.
Jenny Bryan’s lecture lecture on getting data frome the web is an excellent reference on APIs.
This Coursera course is also a good reference (though it is in python).
Here are some other APIs you might find interesting