The references for these notes

Prerequisites:

  • the tidyverse library

  • download the bison dataset from data.world (you have to make a profile)

What you will learn

This lecture will go through workhorse functions in the dplyr package, which will make altering, subsetting and summarizing data much easier than it otherwise would be.

Our main textbook source is chapter 5 of R for Data Science

You will learn to:

  • Install and load the tidyverse suite of packages

  • Load csv files into data frames and tibbles with readr and base R functions

  • Subset columns using select

  • Rename columns with rename

  • Subset rows of data frames using filter and slice

  • Reorder rows with arrange

  • Create and alter variables with mutate and mutate_each

  • Bonus: Understand basics of date/time variable types

  • Use ‘pipe’ expressions to combine multiple operations

  • Alter data for grouped rows with group_by

  • Bonus: ‘As-the-crow-flies’ distances from coordinate data

  • Summarize and create subgroup summaries with summarise and summarise_each. Note the British spelling of summarise in the functions

Dataset

American Bison. Source: Defenders of Wildlife. Except for this bison, all photos in this lecture are of living things observed in Orange County, N.C., according to this dataset.

You can find many, many interesting datasets from U.S. governmental agencies through data.gov—including the U.S. Geological Survey’s Biodeversity Serving Our Nation (BISON) resource. Later in the class you will learn how to do even more with resources like BISON using its API.

Today we will work with a static dataset: Species occurance dataset for Orange County, N.C.

Like all information from BISON, it gives only basic information. Each row gives the location, reporting source, and species information for sightings of specific animals in Orange County.

Quoting the help page:

BISON provides access to georeferenced (those with latitude and longitude coordinates) and non-georeferenced data describing the occurrence or presence of terrestrial and aquatic species recorded or collected by a person (or instrument) at a specific time in the United States and its Territories .

We will have 65,499 observations on 37 variables.

Typically you would want to spend some time figuring out exactly what each variable in your dataset represents—with the help, ideally, of a detailed guide from the data source. Here the variables we are going to use are self-explanatory and many variables we will just ignore.

We will demonstrate dplyr functions using this dataset. The textbook chapter linked above has examples on a different dataset, and looking at both can help you get a better sense of how to use these tools.

What do you want to know?

This is always a good question to ask yourself before starting on a data science project. Here, we are just trying to demonstrate key dplyr functions, so the answer is less important.

To help keep us focused some basic questions:

  • What time period do these data represent?

  • What were the most commonly observed species in Orange County, as represented by this dataset?

  • What were the most common sources of reported sightings? Remember each row is a reported sighting of a particular species at a particular time and from a certain source.

  • How physically spread out were the sightings for particular species?

  • How spread out in time were the sightings for particular species?

You could think of others, but those questions will give some direction while trying to show off dplyr functionality.

Into the tidyverse

Hadley Wickham has created many fantastic R packages. The tidyverse package collects Wickham’s tools for R, including dplyr and also many more

If you have not yet installed tidyverse go ahead and do so

install.packages('tidyverse')

A note of warning about ‘non-standard evaluation’ to read later: Standard functions in the tidyverse suite allow you to enter variable names as arguments without quotations. You will see how that works, so skip ahead if you really need to know right now what that means. Understanding non-standard evaluation is helpful when you do more advanced programming tasks. We will come to that, if we have time, near the end of the course.

For now, just be aware that using unquoted variable names in the functions is only the standard for tidyverse functions, not for all functions in R.

First look at the data

Woodland vole. Source: Wikipedia.

We will focus here on the more common and most basic data loading task: loading a comma-separated (csv) file into a data frame.

As you’ve already seen, data frames will be our basic R object for storing data. The textbook talks about ‘tibbles,’ which are data frames with a few convenient modifications. You can read the ‘tibbles’ chapter in the book to see how the two differ. Here, we will not distinguish between data frames and ‘tibbles’ unless necessary.

The key is that:

  • tidyverse functions will work for data frames and tibbles and by will return tibble objects.

  • For most purposes the two work the same.

read_csv

We discussed setting working directories in the last lecture. The code in this lecture assumes the data set and R code are in the same directory.

library(tidyverse)
bison <- read_csv('bison_orangecounty.csv')

class(bison)

str(bison)
head(bison)
tail(bison)

Take a minute to glance at the information, checking out what variable types you have and getting a sense of what some of them look like.

read_csv has done some useful things automatically:

  • it did not attempt to convert character-class vectors into factors, which is the default for read.csv from base R.

  • it converted the eventDate variable into a date/time object. You will learn how to do this. Let’s re-load the dataset using read.csv so you can learn how to create date/time vectors on your own, later.

bison <- read.csv('bison_orangecounty.csv', stringsAsFactors = FALSE)

class(bison)

head(bison)

Notice this is now a data frame object, not a tibble, and you can see one of the minor differences: When printing parts of a tibble, you get only a neat and limited view. Printing data frames is much messier but more complete.

Convert to a tibble

Let’s convert our data frame to a tibble mainly to avoid monstrous print-outs in steps below.

bison <- as_tibble(bison)

read_csv vs. read.csv

The readr package and base R can read many table-type datasets. Let’s focus on csv files, but the basic principles below make sense for the more general read_table and read.table functions.

See chapter 11 of the textbook for more.

  • read_csv is faster so works better on larger datasets.

  • read_csv has some helpful defaults, such as not converting strings to factors, and warning messages.

  • read.csv has less going on, so you might run into errors with read_csv that you don’t get with the base R function. Sometimes starting with read.csv can be a way to see what the problem is.

  • read.csv returns a data frame, not a tibble, which you might want.

Type ?read_csv to learn more. See the textbook chapter 11.1 for some more points of comparison.

Subsetting columns

Davie Poplar, a tulip poplar more than 300 years old. Photo from the 1970s. Source: A View to Hugh blog, maintained by a Wilson Library photo archivist.

To return column a data frame as a vector

bison$eventDate

# or 

bison['eventDate']

But what if you want to return an entire collection of variables as a new data frame? The dplyr function select let’s you do that. For example you might want to remove columns from your data frame is because you only need a few variables. That’s the case here, where we have many columns that are not really information we are interested in. Given the questions we set down above, we reduce the number of columns in our dataset.

To select a particular variable or set of variables by name, type the unquoted variable name in the function after specifying the data frame you are drawing from.

select(bison, eventDate, year)
## # A tibble: 65,499 × 2
##            eventDate  year
##                <chr> <int>
## 1  2012-12-29T00:00Z  2012
## 2  2012-12-21T00:00Z  2012
## 3  2012-12-30T00:00Z  2012
## 4  2012-12-21T00:00Z  2012
## 5  2012-12-29T00:00Z  2012
## 6  2013-02-23T00:00Z  2013
## 7  2013-02-23T00:00Z  2013
## 8  2013-02-23T00:00Z  2013
## 9  2013-02-23T00:00Z  2013
## 10 2013-02-23T00:00Z  2013
## # ... with 65,489 more rows

To select everything but a specified group of variables, use the negative sign in front of their unquoted names.

select(bison, -eventDate, -year)
## # A tibble: 65,499 × 35
##      bisonID basisOfRecord catalogNumber collectorNumber recordedBy
##        <dbl>         <chr>         <chr>           <chr>      <chr>
## 1  844239256   observation  OBS174171744                 obsr143470
## 2  844267315   observation  OBS173343892                 obsr143470
## 3  846018158   observation  OBS174269666                 obsr143470
## 4  846916268   observation  OBS173343881                 obsr143470
## 5  846967932   observation  OBS174171742                 obsr143470
## 6  940794719   observation  OBS182987416                 obsr143470
## 7  940794958   observation  OBS182987420                 obsr143470
## 8  940795061   observation  OBS182987419                 obsr143470
## 9  940795118   observation  OBS182987418                 obsr143470
## 10 940795350   observation  OBS182987421                 obsr143470
## # ... with 65,489 more rows, and 30 more variables: providerID <int>,
## #   institutionCode <chr>, resourceID <chr>,
## #   ownerInstitutionCollectionCode <chr>, providedScientificName <chr>,
## #   scientificName <chr>, ITISscientificName <chr>,
## #   providedCommonName <chr>, ITIScommonName <chr>, ITIStsn <chr>,
## #   validAcceptedITIStsn <chr>, providedTSN <int>, decimalLatitude <dbl>,
## #   decimalLongitude <dbl>, geodeticDatum <chr>,
## #   coordinatePrecision <dbl>, coordinateUncertaintyInMeters <chr>,
## #   verbatimElevation <chr>, verbatimDepth <chr>, centroid <chr>,
## #   higherGeographyID <int>, computedCountyFips <int>,
## #   providedCounty <chr>, calculatedCounty <chr>, providedState <chr>,
## #   calculatedState <chr>, countryCode <chr>, institutionID <chr>,
## #   collectionID <chr>, relatedResourceID <lgl>

You can also select columns using text conditions, with certain special functions that work inside select and other dplyr functions that use select implicitly.

Here is a list of the ones I think you’ll use most often, and examples are below. To see a full list of these functions, type ?select_helpers

  • contains returns all columns containing a character pattern. For example, if you wanted all variables with the word ‘county’ you would type contains('county') in the select arguments.

  • starts_with returns columns beginning with a certain character pattern

  • ends_with returns those ending with a certain pattern

Those helpers are very useful to keep in mind if you have variable names falling into broad categories as we do here.

The code below shows how to extract variables whose names end with ‘name,’ end with ‘provided,’ or include the word ‘common.’

select(bison, ends_with("name"))
## # A tibble: 65,499 × 5
##                      providedScientificName           scientificName
##                                       <chr>                    <chr>
## 1                        Setophaga coronata       Setophaga coronata
## 2            Sialia sialis (Linnaeus, 1758)            Sialia sialis
## 3   Thryothorus ludovicianus (Latham, 1790) Thryothorus ludovicianus
## 4         Zenaida macroura (Linnaeus, 1758)         Zenaida macroura
## 5   Thryothorus ludovicianus (Latham, 1790) Thryothorus ludovicianus
## 6           Sitta carolinensis Latham, 1790       Sitta carolinensis
## 7  Pipilo erythrophthalmus (Linnaeus, 1758)  Pipilo erythrophthalmus
## 8                        Setophaga coronata       Setophaga coronata
## 9          Toxostoma rufum (Linnaeus, 1758)          Toxostoma rufum
## 10   Cardinalis cardinalis (Linnaeus, 1758)    Cardinalis cardinalis
## # ... with 65,489 more rows, and 3 more variables:
## #   ITISscientificName <chr>, providedCommonName <chr>,
## #   ITIScommonName <chr>
select(bison, starts_with("provided"))
## # A tibble: 65,499 × 5
##                      providedScientificName      providedCommonName
##                                       <chr>                   <chr>
## 1                        Setophaga coronata   Yellow-rumped Warbler
## 2            Sialia sialis (Linnaeus, 1758)        Eastern Bluebird
## 3   Thryothorus ludovicianus (Latham, 1790)           Carolina Wren
## 4         Zenaida macroura (Linnaeus, 1758)           Mourning Dove
## 5   Thryothorus ludovicianus (Latham, 1790)           Carolina Wren
## 6           Sitta carolinensis Latham, 1790 White-breasted Nuthatch
## 7  Pipilo erythrophthalmus (Linnaeus, 1758)          Eastern Towhee
## 8                        Setophaga coronata   Yellow-rumped Warbler
## 9          Toxostoma rufum (Linnaeus, 1758)          Brown Thrasher
## 10   Cardinalis cardinalis (Linnaeus, 1758)       Northern Cardinal
## # ... with 65,489 more rows, and 3 more variables: providedTSN <int>,
## #   providedCounty <chr>, providedState <chr>
select(bison, contains("common"))
## # A tibble: 65,499 × 2
##         providedCommonName
##                      <chr>
## 1    Yellow-rumped Warbler
## 2         Eastern Bluebird
## 3            Carolina Wren
## 4            Mourning Dove
## 5            Carolina Wren
## 6  White-breasted Nuthatch
## 7           Eastern Towhee
## 8    Yellow-rumped Warbler
## 9           Brown Thrasher
## 10       Northern Cardinal
## # ... with 65,489 more rows, and 1 more variables: ITIScommonName <chr>

A final point: you can make the helper functions case-sensitive using the ignore.case argument within a helper function which is set to TRUE by default. Compare what you get here to the case-insensitive statement above.

select(bison, contains("Common", ignore.case = FALSE))
## # A tibble: 65,499 × 1
##         providedCommonName
##                      <chr>
## 1    Yellow-rumped Warbler
## 2         Eastern Bluebird
## 3            Carolina Wren
## 4            Mourning Dove
## 5            Carolina Wren
## 6  White-breasted Nuthatch
## 7           Eastern Towhee
## 8    Yellow-rumped Warbler
## 9           Brown Thrasher
## 10       Northern Cardinal
## # ... with 65,489 more rows

Trimming our dataset

Given our loose goals above, let’s trim the variables to include only those we might want to use. This shows you can combine all of the ways we saw to select columns.

According to the BISON help pages, observers providing the data give their own common and scientific names. BISON standardizes them using the Integrated Taxonomic Information System (ITIS), so those could be the only species names we keep.

You should think about that choice for a minute: We eventually will want to group observations by those species names, subsetting and summarizing information using them. That means we will want all rows with a certain name to represent the same thing. Standardized species names gives us that. Without it, our job as data analysts would be more difficult.

However, as you will see in the filtering section below, the standard species names have some blank values. The other names columns help us to see what is going on, so we keep them to assess data quality.

bison <- select(bison, bisonID, contains("name"), providerID, institutionCode, 
                decimalLatitude, decimalLongitude,
                eventDate, year)

Order matters: Variables will appear in the new data frame in the order we name them in select, from left to right.

Renaming variables

For convenience, let’s change some variable names. dplyr’s function rename does that. It wouldn’t be difficult to do without rename, but it does make the code cleaner. The syntax is rename(data frame, var1_new_name = var1_old_name, var2_new_name = var2_old_name, etc.) as shown below.

bison <- rename(bison, 
                latitude = decimalLatitude, longitude = decimalLongitude)

Subsetting rows

Yellow-crowned night-heron. Source: Audobon.

You have already seen how to select rows from a data frame using base R. dplyr gives you a variety of ways to subset rows of data easily and to do so based on logical conditions.

For example, if you wanted to extract the 51st through 60th rows of the data frame, you can do so using slice. See the comparison with base R functionality.

slice(bison, 51:60)
## # A tibble: 10 × 12
##      bisonID                  providedScientificName
##        <dbl>                                   <chr>
## 1  933972761     Baeolophus bicolor (Linnaeus, 1766)
## 2  933980181         Spinus tristis (Linnaeus, 1758)
## 3  933980183        Icterus galbula (Linnaeus, 1758)
## 4  933980192       Corvus brachyrhynchos Brehm, 1822
## 5  933980502 Thryothorus ludovicianus (Latham, 1790)
## 6  933986851        Toxostoma rufum (Linnaeus, 1758)
## 7  933987932    Poecile carolinensis (Audubon, 1834)
## 8  933992282   Melanerpes carolinus (Linnaeus, 1758)
## 9  933992451    Poecile carolinensis (Audubon, 1834)
## 10 933992455         Sitta canadensis Linnaeus, 1766
## # ... with 10 more variables: scientificName <chr>,
## #   ITISscientificName <chr>, providedCommonName <chr>,
## #   ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## #   latitude <dbl>, longitude <dbl>, eventDate <chr>, year <int>
bison[51:60, ]
## # A tibble: 10 × 12
##      bisonID                  providedScientificName
##        <dbl>                                   <chr>
## 1  933972761     Baeolophus bicolor (Linnaeus, 1766)
## 2  933980181         Spinus tristis (Linnaeus, 1758)
## 3  933980183        Icterus galbula (Linnaeus, 1758)
## 4  933980192       Corvus brachyrhynchos Brehm, 1822
## 5  933980502 Thryothorus ludovicianus (Latham, 1790)
## 6  933986851        Toxostoma rufum (Linnaeus, 1758)
## 7  933987932    Poecile carolinensis (Audubon, 1834)
## 8  933992282   Melanerpes carolinus (Linnaeus, 1758)
## 9  933992451    Poecile carolinensis (Audubon, 1834)
## 10 933992455         Sitta canadensis Linnaeus, 1766
## # ... with 10 more variables: scientificName <chr>,
## #   ITISscientificName <chr>, providedCommonName <chr>,
## #   ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## #   latitude <dbl>, longitude <dbl>, eventDate <chr>, year <int>

which does the same thing as

bison[51:60, ]

You can also grab the first, last or nth element of a vector with functions of that name.

first(bison$bisonID)
## [1] 844239256
last(bison$bisonID)
## [1] 1219788601
nth(bison$bisonID, n = 51)
## [1] 933972761

Conditional subsetting with filter

Carolina chickadee. Source: Cornell Lab of Ornithology.

Much more useful is the ability to subset your data frame not by row indices, which often are meaningless, but by specific logical conditions.

Each argument in filter, separated by commas, should evaluate to a logical vector. filter will return the subset of the data frame for which all statements are TRUE.

Let’s say we wanted to select only those observations marking a sighting of carolina chickadee, which you can find year-round here.

filter(bison, ITIScommonName == "Carolina Chickadee")
## # A tibble: 1,478 × 12
##      bisonID               providedScientificName       scientificName
##        <dbl>                                <chr>                <chr>
## 1  940795743 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 2  943427969 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 3  933987932 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 4  933992451 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 5  936922013 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 6  971994869 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 7  972287684 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 8  973919771 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 9  973929092 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 10 974747631 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## # ... with 1,468 more rows, and 9 more variables:
## #   ITISscientificName <chr>, providedCommonName <chr>,
## #   ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## #   latitude <dbl>, longitude <dbl>, eventDate <chr>, year <int>

Let’s see an example with a numerical criterion, returning Carolina Chickadee sightings since 2013.

filter(bison, ITIScommonName == "Carolina Chickadee", year >= 2013)
## # A tibble: 318 × 12
##      bisonID               providedScientificName       scientificName
##        <dbl>                                <chr>                <chr>
## 1  940795743 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 2  943427969 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 3  933992451 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 4  936922013 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 5  971994869 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 6  972287684 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 7  973919771 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 8  973929092 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 9  974747631 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 10 975309669 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## # ... with 308 more rows, and 9 more variables: ITISscientificName <chr>,
## #   providedCommonName <chr>, ITIScommonName <chr>, providerID <int>,
## #   institutionCode <chr>, latitude <dbl>, longitude <dbl>,
## #   eventDate <chr>, year <int>

There are three basic logical operations in R

  • & AND
  • | OR (inclusive)
  • ! NOT

You can have multiple conditions, combining them using the various logical operations you already learned. In filter, the comma acts as the logical AND.

Here is a quick visual guide to logical operations from the textbook.

Logical and set operations. Source: R for Data Science chapter 5.

A way to explore the data, spot-check quality

With large datasets, you will not be able to just scroll through all the data to catch errors. You will need to develop ways to spot possible errors as you explore and tweak the data. Conditional subsetting with filter is a great way to look at a limited part of the data that might contain errors.

Below we filter for all records whose common name is Carolina Chickadee but whose scientific name is not the correct one. That would suggest a data quality issue.

Here, since we used the standardized common and scientific names, we have no problems.

filter(bison, ITIScommonName == "Carolina Chickadee", ITISscientificName != "Poecile carolinensis")
## # A tibble: 0 × 12
## # ... with 12 variables: bisonID <dbl>, providedScientificName <chr>,
## #   scientificName <chr>, ITISscientificName <chr>,
## #   providedCommonName <chr>, ITIScommonName <chr>, providerID <int>,
## #   institutionCode <chr>, latitude <dbl>, longitude <dbl>,
## #   eventDate <chr>, year <int>

Of course, that’s only one particular check on a particular species. We can use the summarize functions below to do more systematic checks.

One systematic check we can do with filter is to see whether there are any missing values in the scientific names or common names columns.

filter(bison, is.na(ITISscientificName) | is.na(ITIScommonName))
## # A tibble: 0 × 12
## # ... with 12 variables: bisonID <dbl>, providedScientificName <chr>,
## #   scientificName <chr>, ITISscientificName <chr>,
## #   providedCommonName <chr>, ITIScommonName <chr>, providerID <int>,
## #   institutionCode <chr>, latitude <dbl>, longitude <dbl>,
## #   eventDate <chr>, year <int>

Great! No problems there, at least. But now look:

filter(bison, ITISscientificName == "" | ITIScommonName == "")
## # A tibble: 2,059 × 12
##       bisonID                                       providedScientificName
##         <dbl>                                                        <chr>
## 1  1219787010                                        Hylodesmum nudiflorum
## 2  1219786998                                         Botrypus virginianus
## 3  1219787016 Fraxinus [americana + biltmoreana + pennsylvanica + smallii]
## 4  1219787044                                Smilax bona-nox var. bona-nox
## 5  1219991828                 Fraxinus [americana + biltmoreana + smallii]
## 6  1219991832                         Hamamelis virginiana var. virginiana
## 7  1220111105                                 Celtis [laevigata + smallii]
## 8  1220068949                                                      Unknown
## 9  1219787903                    Muscadinia rotundifolia var. rotundifolia
## 10 1219787839                                         Endodeca serpentaria
## # ... with 2,049 more rows, and 10 more variables: scientificName <chr>,
## #   ITISscientificName <chr>, providedCommonName <chr>,
## #   ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## #   latitude <dbl>, longitude <dbl>, eventDate <chr>, year <int>

You have just used filter to spot and investigate a data quality issue.

This is why we kept multiple names variables, aside from just the one we ideally would use—the standardized (ITIS) species names.

If we were doing some actual analysis with this dataset, you would think about how to resolve the problem of missing standardized species names. This is where content knowledge or extra research and documentation would be very helpful.

For now, we’ll just note the problem and move on.

When to use filter

As we saw above, filter is very useful for exploring the dataset, particularly when looking for data quality problems. Large datasets can be unwieldy, and this gives a way to inspect visually some problematic sections.

Otherwise, this function is best used in combination with different functions as a way of preparing to do something else. It didn’t look like much above, but you will see many examples throughout the course where filter plays an important role.

Common times filter is used include before making a graph, to plot just a particular subset of the data, or to isolate certain sections of the data for use in further analyses.

Reordering with arrange

Eastern garter snake. Source: Wikipedia

When to use arrange

Sometimes it is useful to re-order the rows of your dataset in a particular way. Common reasons you want to do that include:

  • Your data has a natural order such as a time order and you want to use that order.

For example, say you have a dataset with U.S. gross domestic product by year, and you want to calculate a new variable with the year-over-year change in GDP. You want subsequent years to follow each other, so you need to arrange first before using some of the variable creation methods demonstrated below.

  • You want to present a particular order, for example to produce a table with the top 10 observations by some measure.

Ordering by numeric variables

Let’s arrange our data frame rows by year. To show you it worked, we used select to pick out a few variables to be printed.

select(arrange(bison, year), year, ITISscientificName)
## # A tibble: 65,499 × 2
##     year      ITISscientificName
##    <int>                   <chr>
## 1   1700         Lepomis auritus
## 2   1700        Lepomis gibbosus
## 3   1700 Notemigonus crysoleucas
## 4   1700   Cyprinella analostana
## 5   1700      Notropis scepticus
## 6   1700         Notropis procne
## 7   1822                        
## 8   1845                        
## 9   1846                        
## 10  1846                        
## # ... with 65,489 more rows

Use the helper function desc to arrange in descending rather than ascending order.

select(arrange(bison, desc(year)), year, ITISscientificName)
## # A tibble: 65,499 × 2
##     year       ITISscientificName
##    <int>                    <chr>
## 1   2015      Arisaema triphyllum
## 2   2015       Onoclea sensibilis
## 3   2015            Viola striata
## 4   2015       Solidago pinetorum
## 5   2015       Glechoma hederacea
## 6   2015      Fragaria virginiana
## 7   2015 Scutellaria integrifolia
## 8   2015       Onoclea sensibilis
## 9   2015                         
## 10  2015       Goodyera pubescens
## # ... with 65,489 more rows

Ordering by non-numeric variables

This works for characters as well, but you should make note of how the sorting handles blanks and NA values. See the tips section below.

select(arrange(bison, ITISscientificName), year, ITISscientificName)
## # A tibble: 65,499 × 2
##     year ITISscientificName
##    <int>              <chr>
## 1   1977                   
## 2   1977                   
## 3   1977                   
## 4   1977                   
## 5   2002                   
## 6   2002                   
## 7   2008                   
## 8   1992                   
## 9   1977                   
## 10  1977                   
## # ... with 65,489 more rows
select(arrange(bison, desc(ITISscientificName)), year, ITISscientificName)
## # A tibble: 65,499 × 2
##     year     ITISscientificName
##    <int>                  <chr>
## 1   2012 Zonotrichia leucophrys
## 2   2009 Zonotrichia leucophrys
## 3   2010 Zonotrichia leucophrys
## 4   2012 Zonotrichia leucophrys
## 5   2013 Zonotrichia leucophrys
## 6   2012 Zonotrichia leucophrys
## 7   2012 Zonotrichia leucophrys
## 8   2013 Zonotrichia leucophrys
## 9   2011 Zonotrichia leucophrys
## 10  2011 Zonotrichia leucophrys
## # ... with 65,489 more rows

Aside: Create a tibble (or data frame) from scratch

To demonstrate the rules arrange uses for character vectors we can also demonstrate how to create a tibble or data frame from scratch.

See chapter 10 of R for Data Science for more detailed explanations of this process.

Pay attention to what variable is being used to arrange here.

test <- tibble(name_lower = c("a", "q", "s", "z", "", NA, "ab", "aq", "!", "?"), 
               name_upper = c("Q", NA, "A", "", "B", "Z", "AB", "AQ", "!", "?"),
               name_all = c("A", NA, "a", "", "z", "Z", "Az", "AZ", "!?", "!!"))

arrange(test, name_lower)
## # A tibble: 10 × 3
##    name_lower name_upper name_all
##         <chr>      <chr>    <chr>
## 1                      B        z
## 2           !          !       !?
## 3           ?          ?       !!
## 4           a          Q        A
## 5          ab         AB       Az
## 6          aq         AQ       AZ
## 7           q       <NA>     <NA>
## 8           s          A        a
## 9           z                    
## 10       <NA>          Z        Z
arrange(test, desc(name_lower))
## # A tibble: 10 × 3
##    name_lower name_upper name_all
##         <chr>      <chr>    <chr>
## 1           z                    
## 2           s          A        a
## 3           q       <NA>     <NA>
## 4          aq         AQ       AZ
## 5          ab         AB       Az
## 6           a          Q        A
## 7           ?          ?       !!
## 8           !          !       !?
## 9                      B        z
## 10       <NA>          Z        Z
arrange(test, name_upper)
## # A tibble: 10 × 3
##    name_lower name_upper name_all
##         <chr>      <chr>    <chr>
## 1           z                    
## 2           !          !       !?
## 3           ?          ?       !!
## 4           s          A        a
## 5          ab         AB       Az
## 6          aq         AQ       AZ
## 7                      B        z
## 8           a          Q        A
## 9        <NA>          Z        Z
## 10          q       <NA>     <NA>
arrange(test, desc(name_upper))
## # A tibble: 10 × 3
##    name_lower name_upper name_all
##         <chr>      <chr>    <chr>
## 1        <NA>          Z        Z
## 2           a          Q        A
## 3                      B        z
## 4          aq         AQ       AZ
## 5          ab         AB       Az
## 6           s          A        a
## 7           ?          ?       !!
## 8           !          !       !?
## 9           z                    
## 10          q       <NA>     <NA>
arrange(test, name_all)
## # A tibble: 10 × 3
##    name_lower name_upper name_all
##         <chr>      <chr>    <chr>
## 1           z                    
## 2           ?          ?       !!
## 3           !          !       !?
## 4           s          A        a
## 5           a          Q        A
## 6          ab         AB       Az
## 7          aq         AQ       AZ
## 8                      B        z
## 9        <NA>          Z        Z
## 10          q       <NA>     <NA>
arrange(test, desc(name_all))
## # A tibble: 10 × 3
##    name_lower name_upper name_all
##         <chr>      <chr>    <chr>
## 1        <NA>          Z        Z
## 2                      B        z
## 3          aq         AQ       AZ
## 4          ab         AB       Az
## 5           a          Q        A
## 6           s          A        a
## 7           !          !       !?
## 8           ?          ?       !!
## 9           z                    
## 10          q       <NA>     <NA>

Those demos lead to a few handy tips.

Ordering rules for arrange

Some things to keep in mind when using arrange:

  • NA values are placed at the end even when sorting in descending order. That holds for both numeric and character vectors.

  • Blank characters are “smaller” than any other character.

  • Be careful with punctuation ordering. Experiment. But really, you shouldn’t be trying to order by punctuation because… what does that even mean?

  • Letters appearing later in the alphabet are “bigger”. When using increasing order, “a” appears before “z” for example.

  • Capital letters are “bigger” than lower-case letters. So “A” appears before “a” when using descending order.

  • Ordering begins with the first letter then proceeds letter-by-letter. “ab” therefore is “smaller” than “az” and will appear before it when using ascending order.

Using multiple criteria.

You can arrange using multiple criteria, as well, where the order in which you list variables in arrange determines on which variables you sort first.

select(arrange(bison, desc(year), desc(ITISscientificName)), year, ITISscientificName)
## # A tibble: 65,499 × 2
##     year   ITISscientificName
##    <int>                <chr>
## 1   2015  Woodwardia areolata
## 2   2015        Viola striata
## 3   2015        Viola striata
## 4   2015        Viola sororia
## 5   2015        Viola bicolor
## 6   2015 Veronica hederifolia
## 7   2015  Verbascum blattaria
## 8   2015   Vaccinium tenellum
## 9   2015   Turdus migratorius
## 10  2015 Triodanis perfoliata
## # ... with 65,489 more rows
select(arrange(bison, desc(ITISscientificName), desc(year)), year, ITISscientificName)
## # A tibble: 65,499 × 2
##     year     ITISscientificName
##    <int>                  <chr>
## 1   2013 Zonotrichia leucophrys
## 2   2013 Zonotrichia leucophrys
## 3   2013 Zonotrichia leucophrys
## 4   2013 Zonotrichia leucophrys
## 5   2013 Zonotrichia leucophrys
## 6   2012 Zonotrichia leucophrys
## 7   2012 Zonotrichia leucophrys
## 8   2012 Zonotrichia leucophrys
## 9   2012 Zonotrichia leucophrys
## 10  2012 Zonotrichia leucophrys
## # ... with 65,489 more rows

mutate: alter and create variables

Common buckeye butterfly. Source: iNaturalist

When to use mutate, mutate_each

You’ll want to use mutate just about any time you want to create a new variable in your data frame or alter and overwrite an existing variable.

If you plan to use the same function on multiple variables, use mutate_each.

Altering variables

The syntax for mutate is \[\text{data_frame_2 <- mutate(data_frame_1, unquoted_variable_name = something)} \]

Syntax for mutate_each is \[\text{data_frame_2 <- mutate(data_frame_1, funs(function1_name, function2_name), unquoted variable name or helper function as in } select)\]

mutate_each will apply each function in the funs brackets to each of the columns selected. If no columns are selected, it applies each function to all. You can make custom functions to use with mutate_each, but we will only see that later in the class.

Often, data frame one and two in the statement above will be the same: You are overwriting your existing data frame and adding or altering a variable in it.

Some things we want to do using mutate:

  • Convert latitude and longitude (currently in degrees) to radians, to calculate “as-the-crow-flies” distances between points using the haversine formula. See the grouping section below for how we apply that formula.

  • Make the eventDate variable a date/time type object. See the bonus section below for more on dates an times. Otherwise you don’t need to worry yet about what we did, other than to recognize the basic syntax for mutate.

Some things we want to do using mutate_each:

  • Change all species name variables to lower-case, making them easier to use for all sorts of purposes.
bison <- mutate(bison, 
                latitude_rad = pi * latitude / 180,
                longitude_rad = pi * longitude / 180,
                eventDate = parse_datetime(eventDate))

bison <- mutate_each(bison, funs(tolower), contains("name"))

select(bison, contains("tude"), eventDate)
## # A tibble: 65,499 × 5
##    latitude longitude latitude_rad longitude_rad  eventDate
##       <dbl>     <dbl>        <dbl>         <dbl>     <dttm>
## 1   35.9283   -79.084    0.6270671     -1.380276 2012-12-29
## 2   35.9283   -79.084    0.6270671     -1.380276 2012-12-21
## 3   35.9283   -79.084    0.6270671     -1.380276 2012-12-30
## 4   35.9283   -79.084    0.6270671     -1.380276 2012-12-21
## 5   35.9283   -79.084    0.6270671     -1.380276 2012-12-29
## 6   35.9283   -79.084    0.6270671     -1.380276 2013-02-23
## 7   35.9283   -79.084    0.6270671     -1.380276 2013-02-23
## 8   35.9283   -79.084    0.6270671     -1.380276 2013-02-23
## 9   35.9283   -79.084    0.6270671     -1.380276 2013-02-23
## 10  35.9283   -79.084    0.6270671     -1.380276 2013-02-23
## # ... with 65,489 more rows
select(bison, contains("name"))
## # A tibble: 65,499 × 5
##                      providedScientificName           scientificName
##                                       <chr>                    <chr>
## 1                        setophaga coronata       setophaga coronata
## 2            sialia sialis (linnaeus, 1758)            sialia sialis
## 3   thryothorus ludovicianus (latham, 1790) thryothorus ludovicianus
## 4         zenaida macroura (linnaeus, 1758)         zenaida macroura
## 5   thryothorus ludovicianus (latham, 1790) thryothorus ludovicianus
## 6           sitta carolinensis latham, 1790       sitta carolinensis
## 7  pipilo erythrophthalmus (linnaeus, 1758)  pipilo erythrophthalmus
## 8                        setophaga coronata       setophaga coronata
## 9          toxostoma rufum (linnaeus, 1758)          toxostoma rufum
## 10   cardinalis cardinalis (linnaeus, 1758)    cardinalis cardinalis
## # ... with 65,489 more rows, and 3 more variables:
## #   ITISscientificName <chr>, providedCommonName <chr>,
## #   ITIScommonName <chr>

Creating new variables

The syntax is exactly the same, except that to create a new variable use a new, unquoted variable name on the left-hand side of the equality statement.

This is a just a dummy example.

select(mutate(bison, double_year = year * 2), year, double_year)
## # A tibble: 65,499 × 2
##     year double_year
##    <int>       <dbl>
## 1   2012        4024
## 2   2012        4024
## 3   2012        4024
## 4   2012        4024
## 5   2012        4024
## 6   2013        4026
## 7   2013        4026
## 8   2013        4026
## 9   2013        4026
## 10  2013        4026
## # ... with 65,489 more rows

As a quick fix to our missing scientific name problem, let’s use mutate to create a new scientific name variable using the ITIS names when available and the alternate scientific name variable when not.

We are creating an alternate variable instead of over-writing the current one so we can investigate the scope of the problem later.

bison <- mutate(bison, 
                scientificName_complete = ifelse(ITISscientificName == "", scientificName, ITISscientificName))

filter(bison, scientificName_complete == "" | is.na(scientificName_complete))
## # A tibble: 0 × 15
## # ... with 15 variables: bisonID <dbl>, providedScientificName <chr>,
## #   scientificName <chr>, ITISscientificName <chr>,
## #   providedCommonName <chr>, ITIScommonName <chr>, providerID <int>,
## #   institutionCode <chr>, latitude <dbl>, longitude <dbl>,
## #   eventDate <dttm>, year <int>, latitude_rad <dbl>, longitude_rad <dbl>,
## #   scientificName_complete <chr>

That’s all we’ll say for now.

Throughout the course and the rest of this lecture, you will see many more applications of mutate`

Pipes to keep your code clean

Northern hog sucker. Source: Wikipedia

The ‘pipe’ operation is a handy tool to make your code more legible: %>%. Key points:

  • It takes the output of your previous operation and uses it as an input to your next operation.

  • You can determine where the previous argument goes with the period symbol, . , which acts as a placeholder.

  • Understand how to use it by replacing the pipe operation with ‘then’ (in your mind, not in the code). For example, filter(data, ...) %>% select(...) filters first then selects columns from the output of filter.

When to use a pipe

Often, you will be doing multiple things at a time to your data set and the intermediate steps don’t matter.

In R, for reasons we won’t talk about, you want to avoid creating lots of unnecessary objects in your environment. Using pipes help you do that.

Finally, pipes make your code more legible. For example

c(1, 2, 3, 4) ^ 2 %>% 
    sum %>% 
    sqrt %>% 
    log %>% 
    sin
## [1] 0.9915875

is more legible than

sin(log(sqrt(sum(c(1, 2, 3, 4) ^ 2))))
## [1] 0.9915875

and it only becomes worse when using functions with lots of arguments.

From now on, we will use pipes whenever convenient. Check out the textbook and experiment with pipes to learn their uses and limitations.

Grouping

Yellow crownbeard. Source: virginiawildflowers.org

dplyr function group_by is a helper function for a number of other tools in the package. It essentially creates hidden links between observations in your dataset based on variables you determine.

The group identifications themselves are not variables but instead allow you to do operations constrained to each group separately in the mutate, summarise and filter functions.

That’s abstract, so let’s move to some examples with mutate.

When to use group_by

You will always use group_by first, before running mutate, summarise or filter and their related functions.

Some common situations for when you will want to have grouping variables:

  • With filter: Filter for characteristics by subgroup, such as the best or worst observation by some measure within each group. That returns all observations meeting the criteria within their groups.

You can also filter based on conditions related to the group as a whole, such as the number of observations in the group. That returns a dataset with all observations in a group meeting the criterion.

  • With mutate: Create or alter variables in a way that depends on group-level characteristics, for example calculating the difference between a particular observation’s value and average for that observation’s group. That’s vague, so see the examples in the book and below.

  • With summarise: Create group-level summaries, such as the total number of observations by species in the BISON dataset. The result is a data frame (actually a tibble) with the summary statistics for each group.

We will do ungrouped and grouped summaries all together in the section on summarise.

A WARNING: Grouped operations for filter and mutate can become tricky, so you should be extra careful to be sure what you tried to do is in fact what you did.

So-called window functions work best for grouped operations with mutate and filter, and summary functions—such as sum or mean—are what you want for summarise.

Window functions

As this demo on window functions says,

A window function is a variation on an aggregation function. Where an aggregation function, like sum() and mean(), takes n inputs and return a single value, a window function returns n values.

The output of a window function depends on all its input values, so window functions don’t include functions that work element-wise, like + or round().

Some useful ‘window’ functions that are appropriate for the group_by plus mutate or filter combination:

  • ranking: such as min_rank and percent_rank and cume_dist

  • Cumulative: such as cumsum

  • Time series: such as lead and lag functions that offset the input vector by negative one and one positions, respectively. Be sure that whatever order your vector is in when using those actually has meaning, such as a time-dependent order.

Learn more about those functions and check out some examples at the window functions demo link above.

Using group_by with filter

Let’s filter our dataset to include only the species that appear at least 200 times in the Orange County biodiversity dataset.

To do so, we’ll use the helper function n() which just counts the number of observations within a group. It only works inside other dplyr functions.

group_by(bison, scientificName_complete) %>%
  filter(n() > 200)
## Source: local data frame [39,150 x 15]
## Groups: scientificName_complete [73]
## 
##      bisonID                   providedScientificName
##        <dbl>                                    <chr>
## 1  844239256                       setophaga coronata
## 2  844267315           sialia sialis (linnaeus, 1758)
## 3  846018158  thryothorus ludovicianus (latham, 1790)
## 4  846916268        zenaida macroura (linnaeus, 1758)
## 5  846967932  thryothorus ludovicianus (latham, 1790)
## 6  940794719          sitta carolinensis latham, 1790
## 7  940794958 pipilo erythrophthalmus (linnaeus, 1758)
## 8  940795061                       setophaga coronata
## 9  940795118         toxostoma rufum (linnaeus, 1758)
## 10 940795350   cardinalis cardinalis (linnaeus, 1758)
## # ... with 39,140 more rows, and 13 more variables: scientificName <chr>,
## #   ITISscientificName <chr>, providedCommonName <chr>,
## #   ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## #   latitude <dbl>, longitude <dbl>, eventDate <dttm>, year <int>,
## #   latitude_rad <dbl>, longitude_rad <dbl>, scientificName_complete <chr>

You can combine the grouped filter with ungrouped filtering. But this is where you have to think carefully.

Here we’ll filter for the species seen at least 3 times in the entire dataset, but we will only want the observations of those species that appear in 2015.

group_by(bison, scientificName_complete) %>%
  filter(n() > 3, year == 2015)
## Source: local data frame [125 x 15]
## Groups: scientificName_complete [90]
## 
##       bisonID                     providedScientificName
##         <dbl>                                      <chr>
## 1  1092883743            arisaema triphyllum (l.) schott
## 2  1092883702                      onoclea sensibilis l.
## 3  1143531724                   solidago pinetorum small
## 4  1088924502                       glecoma hederacea l.
## 5  1088947111                  fragaria virginiana mill.
## 6  1143531759                scutellaria integrifolia l.
## 7  1092895216                      onoclea sensibilis l.
## 8  1134968123          goodyera pubescens (willd.) r.br.
## 9  1092905293                         clematis viorna l.
## 10 1088924837 erythronium umbilicatum c.r.parks & hardin
## # ... with 115 more rows, and 13 more variables: scientificName <chr>,
## #   ITISscientificName <chr>, providedCommonName <chr>,
## #   ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## #   latitude <dbl>, longitude <dbl>, eventDate <dttm>, year <int>,
## #   latitude_rad <dbl>, longitude_rad <dbl>, scientificName_complete <chr>

Here we’ll filter for the species seen at least 3 times in 2015. To do so we need to group by year as well as scientific name.

group_by(bison, scientificName_complete, year) %>%
  filter(n() > 3, year == 2015)
## Source: local data frame [8 x 15]
## Groups: scientificName_complete, year [2]
## 
##      bisonID                                providedScientificName
##        <dbl>                                                 <chr>
## 1 1088951040                                claytonia virginica l.
## 2 1088924555                                claytonia virginica l.
## 3 1088924584                                claytonia virginica l.
## 4 1088926416                                claytonia virginica l.
## 5 1088924185 asplenium platyneuron (l.) britton, sterns & poggenb.
## 6 1088924277 asplenium platyneuron (l.) britton, sterns & poggenb.
## 7 1088898166 asplenium platyneuron (l.) britton, sterns & poggenb.
## 8 1088957895 asplenium platyneuron (l.) britton, sterns & poggenb.
## # ... with 13 more variables: scientificName <chr>,
## #   ITISscientificName <chr>, providedCommonName <chr>,
## #   ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## #   latitude <dbl>, longitude <dbl>, eventDate <dttm>, year <int>,
## #   latitude_rad <dbl>, longitude_rad <dbl>, scientificName_complete <chr>

Notice that in each case we get back all observations in the groups meeting the criteria.

Using group_by with mutate

Ebony spleenwort, one of only two species seen at least three times in 2015 in the Orange County BISON data. Source: Stephen F. Austin State University

We’ll see a few examples to give you a flavor for what you can do.

Unfortunately, this dataset isn’t great for demonstrating grouped mutate functions. See the textbook or the window function vignette above for more examples.

Let’s give it a shot though, with a contrived example. A more interesting and more advanced application of this kind of thing is in the bonus section on as-the-crow-flies distances.

We will:

  • group by organization providing a species sighting using provider ID and by year.

  • calculate the time difference in weeks between the first and final observations of the year for that organization.

Notice that you can also use pipes within mutate.

group_by(bison, providerID, year) %>%
  mutate(d_time = as.numeric(max(eventDate) - min(eventDate), "weeks")) %>%
  select(year, providerID, d_time)
## Source: local data frame [65,499 x 3]
## Groups: providerID, year [520]
## 
##     year providerID   d_time
##    <int>      <int>    <dbl>
## 1   2012        602 52.14286
## 2   2012        602 52.14286
## 3   2012        602 52.14286
## 4   2012        602 52.14286
## 5   2012        602 52.14286
## 6   2013        602 52.00000
## 7   2013        602 52.00000
## 8   2013        602 52.00000
## 9   2013        602 52.00000
## 10  2013        602 52.00000
## # ... with 65,489 more rows

Now do something similar but for species by year, but also computing the sightings per week for each species, each year, for the span over which they were observed.

Notice that mutate allows you to use newly created variables in subsequent statements. That is common throughout the dplyr functions.

group_by(bison, scientificName_complete, year) %>%
  mutate(d_time = as.numeric(max(eventDate) - min(eventDate), "weeks"),
         avg_obs = n() / d_time) %>%
  select(year, scientificName_complete, ITIScommonName, d_time, avg_obs)
## Source: local data frame [65,499 x 5]
## Groups: scientificName_complete, year [9,159]
## 
##     year  scientificName_complete
##    <int>                    <chr>
## 1   2012       setophaga coronata
## 2   2012            sialia sialis
## 3   2012 thryothorus ludovicianus
## 4   2012         zenaida macroura
## 5   2012 thryothorus ludovicianus
## 6   2013       sitta carolinensis
## 7   2013  pipilo erythrophthalmus
## 8   2013       setophaga coronata
## 9   2013          toxostoma rufum
## 10  2013    cardinalis cardinalis
## # ... with 65,489 more rows, and 3 more variables: ITIScommonName <chr>,
## #   d_time <dbl>, avg_obs <dbl>

Lastly, we will calculate the time between observations for each species in a year using the lag function. Remember that first we have to arrange the data in increasing time order for this to make sense.

Using arrange before vs. after the grouping statement will make a difference.

Arranging before the grouping puts all observations in time order. Arranging after grouping puts observations in time order within their groups.

For example, say we have observations for a Carolina wren on Dec. 12, 2013 and Jan. 2, 2014, and for a bluebird on Dec. 20, 2013 and Dec. 31 2013.

  • Arranging before grouping will give this order: wren Dec. 12, bluebird Dec. 20, bluebird Dec. 31, and wren Jan. 2

  • Arranging after grouping gives this order: (wren Dec. 12, wren Jan. 2) and (bluebird Dec. 20, bluebird Dec. 31)

On the data:

group_by(bison, scientificName_complete) %>%
  mutate(d_time = as.numeric(eventDate - lag(eventDate), "weeks")) %>%
  select(scientificName_complete, d_time) %>%
  arrange(desc(d_time))
## Source: local data frame [65,499 x 2]
## Groups: scientificName_complete [3,161]
## 
##     scientificName_complete    d_time
##                       <chr>     <dbl>
## 1           notropis procne 14933.143
## 2          lepomis gibbosus 14171.571
## 3   notemigonus crysoleucas 13550.857
## 4     cyprinella analostana 13550.857
## 5           lepomis auritus 13072.571
## 6        notropis scepticus 13005.000
## 7              carya glabra  4127.429
## 8        spizella passerina  3759.429
## 9    helmitheros vermivorum  3622.286
## 10 dichanthelium acuminatum  3555.429
## # ... with 65,489 more rows

We will get NA values if there are no previous observations.

Aside: Checking your work

Swallowtail shiner. More than 250 years passed between observations of it in the BISON dataset. Does that make sense? Source: USGS

You often have to check that you did the somewhat complicated grouped mutation operations correctly. Let’s do that using some of the tools we learned.

First, let’s check that crazy number in the final step above. The swallow-tailed shiner apparently had almost 300 years between observations in our data. Is that true?

filter(bison, scientificName_complete == "notropis procne")$eventDate %>% sort
##  [1] "1700-01-01 UTC" "1955-04-21 UTC" "1961-08-02 UTC" "1961-08-02 UTC"
##  [5] "1961-08-02 UTC" "1961-08-02 UTC" "1961-08-16 UTC" "1961-08-16 UTC"
##  [9] "1962-09-25 UTC" "1972-06-12 UTC" "1977-10-21 UTC" "1984-02-18 UTC"
## [13] "1986-01-25 UTC" "1986-03-15 UTC" "1987-06-28 UTC" "1987-06-28 UTC"
## [17] "1992-03-16 UTC" "1992-03-16 UTC" "1998-05-18 UTC" "1998-08-03 UTC"
## [21] "1998-08-03 UTC" "2003-06-12 UTC" "2003-06-12 UTC" "2003-06-12 UTC"
## [25] "2003-06-12 UTC" "2003-06-12 UTC" "2003-09-30 UTC" "2003-09-30 UTC"
## [29] "2003-09-30 UTC" "2003-09-30 UTC" "2005-04-14 UTC" "2005-04-14 UTC"
## [33] "2005-04-14 UTC" "2005-04-15 UTC" "2005-04-15 UTC"

Yes, that makes sense. There’s a single observation in 1700, and the next one is more than 250 years later.

It would be worth checking with the BISON documentation to see where they’d even get a record from Jan. 1, 1700.

Now we will check calculations on time difference between observations submitted by provider.

test <- filter(bison, providerID == 291, year == 1984)

select(test, providerID, institutionCode, eventDate)
## # A tibble: 47 × 3
##    providerID                                 institutionCode  eventDate
##         <int>                                           <chr>     <dttm>
## 1         291 North Carolina State Museum of Natural Sciences 1984-08-15
## 2         291 North Carolina State Museum of Natural Sciences 1984-08-15
## 3         291 North Carolina State Museum of Natural Sciences 1984-08-15
## 4         291 North Carolina State Museum of Natural Sciences 1984-08-15
## 5         291 North Carolina State Museum of Natural Sciences 1984-07-09
## 6         291 North Carolina State Museum of Natural Sciences 1984-07-09
## 7         291 North Carolina State Museum of Natural Sciences 1984-07-09
## 8         291 North Carolina State Museum of Natural Sciences 1984-07-09
## 9         291 North Carolina State Museum of Natural Sciences 1984-07-09
## 10        291 North Carolina State Museum of Natural Sciences 1984-07-09
## # ... with 37 more rows
max(test$eventDate)
## [1] "1984-12-30 UTC"
min(test$eventDate)
## [1] "1984-02-18 UTC"
# Note we don't need group_by since we've filtered to a single group
mutate(test, d_time = as.numeric(max(eventDate) - min(eventDate), "weeks")) %>%
  select(year, providerID, d_time)
## # A tibble: 47 × 3
##     year providerID   d_time
##    <int>      <int>    <dbl>
## 1   1984        291 45.14286
## 2   1984        291 45.14286
## 3   1984        291 45.14286
## 4   1984        291 45.14286
## 5   1984        291 45.14286
## 6   1984        291 45.14286
## 7   1984        291 45.14286
## 8   1984        291 45.14286
## 9   1984        291 45.14286
## 10  1984        291 45.14286
## # ... with 37 more rows

which looks correct.

Quirks of group_by

White-tinged sedge. Source: Missouri Botanical Gardens.

Some minor notes:

  • grouped operations like those above return data frames (tibbles) with group information associated with them

  • Remove group labels using ungroup(data frame). You cannot alter grouping variables in mutate and will get an error message. Other functions sometimes disallow operations on grouping variables.

  • return a list of grouping variables from a grouped data frame using groups(data frame)

  • use group_size to return a vector, the same length as the number of observations, giving the group size of the group associated with that observation

  • group_indices gives a vector with the group index associated to a particular observation

Summarizing and subset summaries

Whitebanded crab spider. Source: carolinanature.com

When to use summarise and summarise_each

Summary functions are for creating summary information, of course.

In other words use summarise to return a single summary statistic for a dataset or for subgroups of a dataset (using group_by first)

Use summarise_each if you want a series of the same summary statistics for different variables.

Summaries for basic information

Finally, we can answer some basic questions about our dataset using summarise and some of the functions you saw above.

How many observations are there in the data per species?

group_by(bison, scientificName_complete) %>%
  summarise(n())
## # A tibble: 3,161 × 2
##    scientificName_complete `n()`
##                      <chr> <int>
## 1              ablabesmyia     6
## 2     abutilon theophrasti     1
## 3                 acalypha     1
## 4       acalypha gracilens     1
## 5     acalypha ostryifolia     1
## 6      acalypha rhomboidea     1
## 7       acalypha virginica     4
## 8          acanthodrilidae     1
## 9                    acari     4
## 10      accipiter cooperii    58
## # ... with 3,151 more rows

What are the top-10 most observed species across all years?

Notice we can rename the variables to be something more legible in the output table.

group_by(bison, scientificName_complete) %>%
  summarise(N_species = n()) %>%
  arrange(desc(N_species))
## # A tibble: 3,161 × 2
##     scientificName_complete N_species
##                       <chr>     <int>
## 1     cardinalis cardinalis      1654
## 2  thryothorus ludovicianus      1521
## 3      poecile carolinensis      1478
## 4        baeolophus bicolor      1459
## 5               pinus taeda      1286
## 6      melanerpes carolinus      1208
## 7     corvus brachyrhynchos      1186
## 8   liquidambar styraciflua      1133
## 9               acer rubrum       995
## 10         zenaida macroura       946
## # ... with 3,151 more rows

Northern cardinal female, the most frequently observed species in our dataset. Source: Wikipedia.

We can combine the grouped mutate and summarize operations to learn average number of weeks between observations per species.

group_by(bison, scientificName_complete) %>%
  mutate(d_time = as.numeric(eventDate - lag(eventDate), "weeks")) %>%
  summarise(avg_d_time = mean(d_time, na.rm = TRUE)) %>%
  arrange(desc(avg_d_time))
## # A tibble: 3,161 × 2
##              scientificName_complete avg_d_time
##                                <chr>      <dbl>
## 1               rhizopogon rubescens   3285.714
## 2                ancyloxypha numitor   3143.101
## 3                    amanita excelsa   2658.286
## 4                penstemon digitalis   2388.714
## 5               clasmatodon parvulus   2145.714
## 6                      marmota monax   1980.857
## 7         luzula acuminata carolinae   1722.143
## 8               lasmigona subviridis   1695.286
## 9  carex muehlenbergii muehlenbergii   1673.714
## 10                  glyceria striata   1626.429
## # ... with 3,151 more rows

Rhizopogon rubescens has the largest inter-observation time of any species in the dataset. Source: Wikipedia

The next example shows you can

  • use computed values as grouping variables

  • change groupings between operations

  • assign group names within group_by

group_by(bison, scientificName_complete) %>%
  mutate(d_time = as.numeric(eventDate - lag(eventDate), "weeks")) %>%
  ungroup %>%
  group_by(long_btw_obs = d_time > 52) %>%
  summarise(N_obs = n())
## # A tibble: 3 × 2
##   long_btw_obs N_obs
##          <lgl> <int>
## 1        FALSE 48956
## 2         TRUE 11694
## 3           NA  4849

The table shows that by far most observations were within a year of the previous observation for that same species.

NA values are likely from those first observations for each species, though we should check there are no others.

summarise_each has essentially the same syntax as mutate_each and represents the same concept, just for summarization.

The silly example below returns the maximum length between observations as well as the maximum providerID number for each subspecies.

group_by(bison, scientificName_complete) %>%
  mutate(d_time = as.numeric(eventDate - lag(eventDate), "weeks")) %>%
  summarise_each(funs(max = max(., na.rm = TRUE)), d_time, providerID)
## # A tibble: 3,161 × 3
##    scientificName_complete  d_time_max providerID_max
##                      <chr>       <dbl>          <int>
## 1              ablabesmyia   0.5714286            440
## 2     abutilon theophrasti          NA            440
## 3                 acalypha          NA            440
## 4       acalypha gracilens          NA            440
## 5     acalypha ostryifolia          NA            407
## 6      acalypha rhomboidea          NA            440
## 7       acalypha virginica 104.4285714            440
## 8          acanthodrilidae          NA            220
## 9                    acari   0.5714286            440
## 10      accipiter cooperii 579.2857143            602
## # ... with 3,151 more rows

The only new trick there was that I defined a new function within the funs helper, specifying I wanted to remove NA values and using the dot as a placeholder for where my variable should go.

We still got some NA values for species with only one observation.

Summaries as data quality checks

Grouped summaries can be an important tool for assessing data quality. Once you have spotted something weird, you can ask yourself:

  • are the errors concentrated?

  • if so, with which observations?

Doing so can help you think about patterns, which in turn lead you to possible reasons and fixes for the data quality problems.

First, we can now look more closely at the missing ITIS scientific names problem.

group_by(bison, providerID) %>%
  summarise(N_errors = sum(ITISscientificName == "")) %>%
  arrange(desc(N_errors))
## # A tibble: 64 × 2
##    providerID N_errors
##         <int>    <int>
## 1         440     1271
## 2         214      472
## 3         136      130
## 4         167       53
## 5         407       33
## 6         120       32
## 7         291       19
## 8         220       14
## 9         602       10
## 10        111        4
## # ... with 54 more rows
filter(bison, providerID == 440) %>% select(institutionCode)
## # A tibble: 30,563 × 1
##    institutionCode
##              <chr>
## 1            BISON
## 2            BISON
## 3            BISON
## 4            BISON
## 5            BISON
## 6            BISON
## 7            BISON
## 8            BISON
## 9            BISON
## 10           BISON
## # ... with 30,553 more rows

You could then refer to the dataset documentation, you would hope, to understand why observations from observer BISON have so many missing scientific names. You could also try grouping by year as well to further isolate the problem.

Bonus: Approximate ‘as the crow flies’ distances

Fish crow. Source: Audobon

A question we might be interested for this dataset is: What is the maximum distance between two observations of the same species within Orange County?

Distances between points are not yet a variable. But we have coordinates for where species were sighted.

We can approximate those distances using the haversine formula.

haversine <- function(lat1, lat2, long1, long2, in_miles = TRUE){
  
  message("Returns great-circle distance between two points on earth in km or miles. Latitude and longitude should be in radians.")
  
  h <- sin(.5 * (lat2 - lat1)) ^ 2 + cos(lat1) * cos(lat2) * (sin(.5 * (long2 - long1)) ^ 2)
  
  # earth diameter taken from http://imagine.gsfc.nasa.gov/features/cosmic/earth_info.html
  
  r <- 12756 / 2
  
  if (in_miles)
    r <- r * 0.621371
  
  return(2 * r * asin(sqrt(h)))
}

Don’t worry about the details of this, or read about them on Wikipedia.

The haversine formula is an old nautical tool to calculate distances along the surface of a sphere—a good approximation to distances along the surface of the almost-spherical earth.

Let’s test it out. We’ll compare the approximation to the NASA terrestrial planet mileage calculator, an exact calculation.

Coordinates of Hanes building at UNC-Chapel Hill are Lat: 35.910810, Long: -79.051205 , and rough coordinates at Nuuk, capital of Greenland are Lat: 64.182179, Long: -51.688468.

haversine(lat1 = pi * 35.910810 / 180, lat2 = pi * 64.182179 / 180, long1 = pi * -79.051205 / 180, long2 = pi * -51.688468 / 180)
## Returns great-circle distance between two points on earth in km or miles. Latitude and longitude should be in radians.
## [1] 2263.758

Answer from NASA: 2274.5 miles.

The haversine formula gives a very good approximation to NASA’s figure, particularly since it is over such a long distance.

All of our points are in Orange County, so the error won’t even matter. For example, here’s the haversine distance estimate the Hanes building to Duke Chapel.

haversine(lat1 = pi * 35.910810 / 180, lat2 = pi * 36.001756 / 180, long1 = pi * -79.051205 / 180, long2 = pi * -78.939975 / 180)
## [1] 8.851984

Answer from NASA: 8.9 miles.

NASA’s tool rounds to the first decimal, so the answer are likely much closer.

Why bother with this bonus section?

Geo-location information is some of the most common and interesting data you will come across. Now, just with latitude and longitude, you can extract as-the-crow flies distances!