The references for these notes
Prerequisites:
the tidyverse
library
download the bison dataset from data.world (you have to make a profile)
This lecture will go through workhorse functions in the dplyr
package, which will make altering, subsetting and summarizing data much easier than it otherwise would be.
Our main textbook source is chapter 5 of R for Data Science
You will learn to:
Install and load the tidyverse
suite of packages
Load csv files into data frames and tibbles
with readr
and base R functions
Subset columns using select
Rename columns with rename
Subset rows of data frames using filter
and slice
Reorder rows with arrange
Create and alter variables with mutate
and mutate_each
Bonus: Understand basics of date/time variable types
Use ‘pipe’ expressions to combine multiple operations
Alter data for grouped rows with group_by
Bonus: ‘As-the-crow-flies’ distances from coordinate data
Summarize and create subgroup summaries with summarise
and summarise_each
. Note the British spelling of summarise in the functions
American Bison. Source: Defenders of Wildlife. Except for this bison, all photos in this lecture are of living things observed in Orange County, N.C., according to this dataset.
You can find many, many interesting datasets from U.S. governmental agencies through data.gov—including the U.S. Geological Survey’s Biodeversity Serving Our Nation (BISON) resource. Later in the class you will learn how to do even more with resources like BISON using its API.
Today we will work with a static dataset: Species occurance dataset for Orange County, N.C.
Like all information from BISON, it gives only basic information. Each row gives the location, reporting source, and species information for sightings of specific animals in Orange County.
Quoting the help page:
BISON provides access to georeferenced (those with latitude and longitude coordinates) and non-georeferenced data describing the occurrence or presence of terrestrial and aquatic species recorded or collected by a person (or instrument) at a specific time in the United States and its Territories .
We will have 65,499 observations on 37 variables.
Typically you would want to spend some time figuring out exactly what each variable in your dataset represents—with the help, ideally, of a detailed guide from the data source. Here the variables we are going to use are self-explanatory and many variables we will just ignore.
We will demonstrate dplyr
functions using this dataset. The textbook chapter linked above has examples on a different dataset, and looking at both can help you get a better sense of how to use these tools.
This is always a good question to ask yourself before starting on a data science project. Here, we are just trying to demonstrate key dplyr
functions, so the answer is less important.
To help keep us focused some basic questions:
What time period do these data represent?
What were the most commonly observed species in Orange County, as represented by this dataset?
What were the most common sources of reported sightings? Remember each row is a reported sighting of a particular species at a particular time and from a certain source.
How physically spread out were the sightings for particular species?
How spread out in time were the sightings for particular species?
You could think of others, but those questions will give some direction while trying to show off dplyr
functionality.
Hadley Wickham has created many fantastic R packages. The tidyverse package collects Wickham’s tools for R, including dplyr
and also many more
If you have not yet installed tidyverse go ahead and do so
install.packages('tidyverse')
A note of warning about ‘non-standard evaluation’ to read later: Standard functions in the tidyverse suite allow you to enter variable names as arguments without quotations. You will see how that works, so skip ahead if you really need to know right now what that means. Understanding non-standard evaluation is helpful when you do more advanced programming tasks. We will come to that, if we have time, near the end of the course.
For now, just be aware that using unquoted variable names in the functions is only the standard for tidyverse functions, not for all functions in R.
Woodland vole. Source: Wikipedia.
We will focus here on the more common and most basic data loading task: loading a comma-separated (csv) file into a data frame.
As you’ve already seen, data frames will be our basic R object for storing data. The textbook talks about ‘tibbles,’ which are data frames with a few convenient modifications. You can read the ‘tibbles’ chapter in the book to see how the two differ. Here, we will not distinguish between data frames and ‘tibbles’ unless necessary.
The key is that:
tidyverse functions will work for data frames and tibbles and by will return tibble objects.
For most purposes the two work the same.
read_csv
We discussed setting working directories in the last lecture. The code in this lecture assumes the data set and R code are in the same directory.
library(tidyverse)
bison <- read_csv('bison_orangecounty.csv')
class(bison)
str(bison)
head(bison)
tail(bison)
Take a minute to glance at the information, checking out what variable types you have and getting a sense of what some of them look like.
read_csv
has done some useful things automatically:
it did not attempt to convert character-class vectors into factors, which is the default for read.csv
from base R.
it converted the eventDate variable into a date/time object. You will learn how to do this. Let’s re-load the dataset using read.csv so you can learn how to create date/time vectors on your own, later.
bison <- read.csv('bison_orangecounty.csv', stringsAsFactors = FALSE)
class(bison)
head(bison)
Notice this is now a data frame object, not a tibble, and you can see one of the minor differences: When printing parts of a tibble, you get only a neat and limited view. Printing data frames is much messier but more complete.
Let’s convert our data frame to a tibble mainly to avoid monstrous print-outs in steps below.
bison <- as_tibble(bison)
read_csv
vs. read.csv
The readr
package and base R can read many table-type datasets. Let’s focus on csv files, but the basic principles below make sense for the more general read_table
and read.table
functions.
See chapter 11 of the textbook for more.
read_csv
is faster so works better on larger datasets.
read_csv
has some helpful defaults, such as not converting strings to factors, and warning messages.
read.csv
has less going on, so you might run into errors with read_csv
that you don’t get with the base R function. Sometimes starting with read.csv
can be a way to see what the problem is.
read.csv
returns a data frame, not a tibble, which you might want.
Type ?read_csv
to learn more. See the textbook chapter 11.1 for some more points of comparison.
Davie Poplar, a tulip poplar more than 300 years old. Photo from the 1970s. Source: A View to Hugh blog, maintained by a Wilson Library photo archivist.
To return column a data frame as a vector
bison$eventDate
# or
bison['eventDate']
But what if you want to return an entire collection of variables as a new data frame? The dplyr
function select
let’s you do that. For example you might want to remove columns from your data frame is because you only need a few variables. That’s the case here, where we have many columns that are not really information we are interested in. Given the questions we set down above, we reduce the number of columns in our dataset.
To select a particular variable or set of variables by name, type the unquoted variable name in the function after specifying the data frame you are drawing from.
select(bison, eventDate, year)
## # A tibble: 65,499 × 2
## eventDate year
## <chr> <int>
## 1 2012-12-29T00:00Z 2012
## 2 2012-12-21T00:00Z 2012
## 3 2012-12-30T00:00Z 2012
## 4 2012-12-21T00:00Z 2012
## 5 2012-12-29T00:00Z 2012
## 6 2013-02-23T00:00Z 2013
## 7 2013-02-23T00:00Z 2013
## 8 2013-02-23T00:00Z 2013
## 9 2013-02-23T00:00Z 2013
## 10 2013-02-23T00:00Z 2013
## # ... with 65,489 more rows
To select everything but a specified group of variables, use the negative sign in front of their unquoted names.
select(bison, -eventDate, -year)
## # A tibble: 65,499 × 35
## bisonID basisOfRecord catalogNumber collectorNumber recordedBy
## <dbl> <chr> <chr> <chr> <chr>
## 1 844239256 observation OBS174171744 obsr143470
## 2 844267315 observation OBS173343892 obsr143470
## 3 846018158 observation OBS174269666 obsr143470
## 4 846916268 observation OBS173343881 obsr143470
## 5 846967932 observation OBS174171742 obsr143470
## 6 940794719 observation OBS182987416 obsr143470
## 7 940794958 observation OBS182987420 obsr143470
## 8 940795061 observation OBS182987419 obsr143470
## 9 940795118 observation OBS182987418 obsr143470
## 10 940795350 observation OBS182987421 obsr143470
## # ... with 65,489 more rows, and 30 more variables: providerID <int>,
## # institutionCode <chr>, resourceID <chr>,
## # ownerInstitutionCollectionCode <chr>, providedScientificName <chr>,
## # scientificName <chr>, ITISscientificName <chr>,
## # providedCommonName <chr>, ITIScommonName <chr>, ITIStsn <chr>,
## # validAcceptedITIStsn <chr>, providedTSN <int>, decimalLatitude <dbl>,
## # decimalLongitude <dbl>, geodeticDatum <chr>,
## # coordinatePrecision <dbl>, coordinateUncertaintyInMeters <chr>,
## # verbatimElevation <chr>, verbatimDepth <chr>, centroid <chr>,
## # higherGeographyID <int>, computedCountyFips <int>,
## # providedCounty <chr>, calculatedCounty <chr>, providedState <chr>,
## # calculatedState <chr>, countryCode <chr>, institutionID <chr>,
## # collectionID <chr>, relatedResourceID <lgl>
You can also select columns using text conditions, with certain special functions that work inside select
and other dplyr
functions that use select
implicitly.
Here is a list of the ones I think you’ll use most often, and examples are below. To see a full list of these functions, type ?select_helpers
contains
returns all columns containing a character pattern. For example, if you wanted all variables with the word ‘county’ you would type contains('county')
in the select
arguments.
starts_with
returns columns beginning with a certain character pattern
ends_with
returns those ending with a certain pattern
Those helpers are very useful to keep in mind if you have variable names falling into broad categories as we do here.
The code below shows how to extract variables whose names end with ‘name,’ end with ‘provided,’ or include the word ‘common.’
select(bison, ends_with("name"))
## # A tibble: 65,499 × 5
## providedScientificName scientificName
## <chr> <chr>
## 1 Setophaga coronata Setophaga coronata
## 2 Sialia sialis (Linnaeus, 1758) Sialia sialis
## 3 Thryothorus ludovicianus (Latham, 1790) Thryothorus ludovicianus
## 4 Zenaida macroura (Linnaeus, 1758) Zenaida macroura
## 5 Thryothorus ludovicianus (Latham, 1790) Thryothorus ludovicianus
## 6 Sitta carolinensis Latham, 1790 Sitta carolinensis
## 7 Pipilo erythrophthalmus (Linnaeus, 1758) Pipilo erythrophthalmus
## 8 Setophaga coronata Setophaga coronata
## 9 Toxostoma rufum (Linnaeus, 1758) Toxostoma rufum
## 10 Cardinalis cardinalis (Linnaeus, 1758) Cardinalis cardinalis
## # ... with 65,489 more rows, and 3 more variables:
## # ITISscientificName <chr>, providedCommonName <chr>,
## # ITIScommonName <chr>
select(bison, starts_with("provided"))
## # A tibble: 65,499 × 5
## providedScientificName providedCommonName
## <chr> <chr>
## 1 Setophaga coronata Yellow-rumped Warbler
## 2 Sialia sialis (Linnaeus, 1758) Eastern Bluebird
## 3 Thryothorus ludovicianus (Latham, 1790) Carolina Wren
## 4 Zenaida macroura (Linnaeus, 1758) Mourning Dove
## 5 Thryothorus ludovicianus (Latham, 1790) Carolina Wren
## 6 Sitta carolinensis Latham, 1790 White-breasted Nuthatch
## 7 Pipilo erythrophthalmus (Linnaeus, 1758) Eastern Towhee
## 8 Setophaga coronata Yellow-rumped Warbler
## 9 Toxostoma rufum (Linnaeus, 1758) Brown Thrasher
## 10 Cardinalis cardinalis (Linnaeus, 1758) Northern Cardinal
## # ... with 65,489 more rows, and 3 more variables: providedTSN <int>,
## # providedCounty <chr>, providedState <chr>
select(bison, contains("common"))
## # A tibble: 65,499 × 2
## providedCommonName
## <chr>
## 1 Yellow-rumped Warbler
## 2 Eastern Bluebird
## 3 Carolina Wren
## 4 Mourning Dove
## 5 Carolina Wren
## 6 White-breasted Nuthatch
## 7 Eastern Towhee
## 8 Yellow-rumped Warbler
## 9 Brown Thrasher
## 10 Northern Cardinal
## # ... with 65,489 more rows, and 1 more variables: ITIScommonName <chr>
A final point: you can make the helper functions case-sensitive using the ignore.case
argument within a helper function which is set to TRUE by default. Compare what you get here to the case-insensitive statement above.
select(bison, contains("Common", ignore.case = FALSE))
## # A tibble: 65,499 × 1
## providedCommonName
## <chr>
## 1 Yellow-rumped Warbler
## 2 Eastern Bluebird
## 3 Carolina Wren
## 4 Mourning Dove
## 5 Carolina Wren
## 6 White-breasted Nuthatch
## 7 Eastern Towhee
## 8 Yellow-rumped Warbler
## 9 Brown Thrasher
## 10 Northern Cardinal
## # ... with 65,489 more rows
Given our loose goals above, let’s trim the variables to include only those we might want to use. This shows you can combine all of the ways we saw to select columns.
According to the BISON help pages, observers providing the data give their own common and scientific names. BISON standardizes them using the Integrated Taxonomic Information System (ITIS), so those could be the only species names we keep.
You should think about that choice for a minute: We eventually will want to group observations by those species names, subsetting and summarizing information using them. That means we will want all rows with a certain name to represent the same thing. Standardized species names gives us that. Without it, our job as data analysts would be more difficult.
However, as you will see in the filtering section below, the standard species names have some blank values. The other names columns help us to see what is going on, so we keep them to assess data quality.
bison <- select(bison, bisonID, contains("name"), providerID, institutionCode,
decimalLatitude, decimalLongitude,
eventDate, year)
Order matters: Variables will appear in the new data frame in the order we name them in select, from left to right.
For convenience, let’s change some variable names. dplyr’s function rename
does that. It wouldn’t be difficult to do without rename, but it does make the code cleaner. The syntax is rename(data frame, var1_new_name = var1_old_name, var2_new_name = var2_old_name, etc.)
as shown below.
bison <- rename(bison,
latitude = decimalLatitude, longitude = decimalLongitude)
Yellow-crowned night-heron. Source: Audobon.
You have already seen how to select rows from a data frame using base R. dplyr
gives you a variety of ways to subset rows of data easily and to do so based on logical conditions.
For example, if you wanted to extract the 51st through 60th rows of the data frame, you can do so using slice
. See the comparison with base R functionality.
slice(bison, 51:60)
## # A tibble: 10 × 12
## bisonID providedScientificName
## <dbl> <chr>
## 1 933972761 Baeolophus bicolor (Linnaeus, 1766)
## 2 933980181 Spinus tristis (Linnaeus, 1758)
## 3 933980183 Icterus galbula (Linnaeus, 1758)
## 4 933980192 Corvus brachyrhynchos Brehm, 1822
## 5 933980502 Thryothorus ludovicianus (Latham, 1790)
## 6 933986851 Toxostoma rufum (Linnaeus, 1758)
## 7 933987932 Poecile carolinensis (Audubon, 1834)
## 8 933992282 Melanerpes carolinus (Linnaeus, 1758)
## 9 933992451 Poecile carolinensis (Audubon, 1834)
## 10 933992455 Sitta canadensis Linnaeus, 1766
## # ... with 10 more variables: scientificName <chr>,
## # ITISscientificName <chr>, providedCommonName <chr>,
## # ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## # latitude <dbl>, longitude <dbl>, eventDate <chr>, year <int>
bison[51:60, ]
## # A tibble: 10 × 12
## bisonID providedScientificName
## <dbl> <chr>
## 1 933972761 Baeolophus bicolor (Linnaeus, 1766)
## 2 933980181 Spinus tristis (Linnaeus, 1758)
## 3 933980183 Icterus galbula (Linnaeus, 1758)
## 4 933980192 Corvus brachyrhynchos Brehm, 1822
## 5 933980502 Thryothorus ludovicianus (Latham, 1790)
## 6 933986851 Toxostoma rufum (Linnaeus, 1758)
## 7 933987932 Poecile carolinensis (Audubon, 1834)
## 8 933992282 Melanerpes carolinus (Linnaeus, 1758)
## 9 933992451 Poecile carolinensis (Audubon, 1834)
## 10 933992455 Sitta canadensis Linnaeus, 1766
## # ... with 10 more variables: scientificName <chr>,
## # ITISscientificName <chr>, providedCommonName <chr>,
## # ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## # latitude <dbl>, longitude <dbl>, eventDate <chr>, year <int>
which does the same thing as
bison[51:60, ]
You can also grab the first, last or nth element of a vector with functions of that name.
first(bison$bisonID)
## [1] 844239256
last(bison$bisonID)
## [1] 1219788601
nth(bison$bisonID, n = 51)
## [1] 933972761
Carolina chickadee. Source: Cornell Lab of Ornithology.
Much more useful is the ability to subset your data frame not by row indices, which often are meaningless, but by specific logical conditions.
Each argument in filter
, separated by commas, should evaluate to a logical vector. filter
will return the subset of the data frame for which all statements are TRUE.
Let’s say we wanted to select only those observations marking a sighting of carolina chickadee, which you can find year-round here.
filter(bison, ITIScommonName == "Carolina Chickadee")
## # A tibble: 1,478 × 12
## bisonID providedScientificName scientificName
## <dbl> <chr> <chr>
## 1 940795743 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 2 943427969 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 3 933987932 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 4 933992451 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 5 936922013 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 6 971994869 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 7 972287684 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 8 973919771 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 9 973929092 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 10 974747631 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## # ... with 1,468 more rows, and 9 more variables:
## # ITISscientificName <chr>, providedCommonName <chr>,
## # ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## # latitude <dbl>, longitude <dbl>, eventDate <chr>, year <int>
Let’s see an example with a numerical criterion, returning Carolina Chickadee sightings since 2013.
filter(bison, ITIScommonName == "Carolina Chickadee", year >= 2013)
## # A tibble: 318 × 12
## bisonID providedScientificName scientificName
## <dbl> <chr> <chr>
## 1 940795743 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 2 943427969 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 3 933992451 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 4 936922013 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 5 971994869 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 6 972287684 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 7 973919771 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 8 973929092 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 9 974747631 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## 10 975309669 Poecile carolinensis (Audubon, 1834) Poecile carolinensis
## # ... with 308 more rows, and 9 more variables: ITISscientificName <chr>,
## # providedCommonName <chr>, ITIScommonName <chr>, providerID <int>,
## # institutionCode <chr>, latitude <dbl>, longitude <dbl>,
## # eventDate <chr>, year <int>
There are three basic logical operations in R
&
AND|
OR (inclusive)!
NOTYou can have multiple conditions, combining them using the various logical operations you already learned. In filter
, the comma acts as the logical AND.
Here is a quick visual guide to logical operations from the textbook.
Logical and set operations. Source: R for Data Science chapter 5.
With large datasets, you will not be able to just scroll through all the data to catch errors. You will need to develop ways to spot possible errors as you explore and tweak the data. Conditional subsetting with filter
is a great way to look at a limited part of the data that might contain errors.
Below we filter for all records whose common name is Carolina Chickadee but whose scientific name is not the correct one. That would suggest a data quality issue.
Here, since we used the standardized common and scientific names, we have no problems.
filter(bison, ITIScommonName == "Carolina Chickadee", ITISscientificName != "Poecile carolinensis")
## # A tibble: 0 × 12
## # ... with 12 variables: bisonID <dbl>, providedScientificName <chr>,
## # scientificName <chr>, ITISscientificName <chr>,
## # providedCommonName <chr>, ITIScommonName <chr>, providerID <int>,
## # institutionCode <chr>, latitude <dbl>, longitude <dbl>,
## # eventDate <chr>, year <int>
Of course, that’s only one particular check on a particular species. We can use the summarize functions below to do more systematic checks.
One systematic check we can do with filter
is to see whether there are any missing values in the scientific names or common names columns.
filter(bison, is.na(ITISscientificName) | is.na(ITIScommonName))
## # A tibble: 0 × 12
## # ... with 12 variables: bisonID <dbl>, providedScientificName <chr>,
## # scientificName <chr>, ITISscientificName <chr>,
## # providedCommonName <chr>, ITIScommonName <chr>, providerID <int>,
## # institutionCode <chr>, latitude <dbl>, longitude <dbl>,
## # eventDate <chr>, year <int>
Great! No problems there, at least. But now look:
filter(bison, ITISscientificName == "" | ITIScommonName == "")
## # A tibble: 2,059 × 12
## bisonID providedScientificName
## <dbl> <chr>
## 1 1219787010 Hylodesmum nudiflorum
## 2 1219786998 Botrypus virginianus
## 3 1219787016 Fraxinus [americana + biltmoreana + pennsylvanica + smallii]
## 4 1219787044 Smilax bona-nox var. bona-nox
## 5 1219991828 Fraxinus [americana + biltmoreana + smallii]
## 6 1219991832 Hamamelis virginiana var. virginiana
## 7 1220111105 Celtis [laevigata + smallii]
## 8 1220068949 Unknown
## 9 1219787903 Muscadinia rotundifolia var. rotundifolia
## 10 1219787839 Endodeca serpentaria
## # ... with 2,049 more rows, and 10 more variables: scientificName <chr>,
## # ITISscientificName <chr>, providedCommonName <chr>,
## # ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## # latitude <dbl>, longitude <dbl>, eventDate <chr>, year <int>
You have just used filter
to spot and investigate a data quality issue.
This is why we kept multiple names variables, aside from just the one we ideally would use—the standardized (ITIS) species names.
If we were doing some actual analysis with this dataset, you would think about how to resolve the problem of missing standardized species names. This is where content knowledge or extra research and documentation would be very helpful.
For now, we’ll just note the problem and move on.
filter
As we saw above, filter
is very useful for exploring the dataset, particularly when looking for data quality problems. Large datasets can be unwieldy, and this gives a way to inspect visually some problematic sections.
Otherwise, this function is best used in combination with different functions as a way of preparing to do something else. It didn’t look like much above, but you will see many examples throughout the course where filter plays an important role.
Common times filter
is used include before making a graph, to plot just a particular subset of the data, or to isolate certain sections of the data for use in further analyses.
arrange
Eastern garter snake. Source: Wikipedia
arrange
Sometimes it is useful to re-order the rows of your dataset in a particular way. Common reasons you want to do that include:
For example, say you have a dataset with U.S. gross domestic product by year, and you want to calculate a new variable with the year-over-year change in GDP. You want subsequent years to follow each other, so you need to arrange first before using some of the variable creation methods demonstrated below.
Let’s arrange our data frame rows by year. To show you it worked, we used select to pick out a few variables to be printed.
select(arrange(bison, year), year, ITISscientificName)
## # A tibble: 65,499 × 2
## year ITISscientificName
## <int> <chr>
## 1 1700 Lepomis auritus
## 2 1700 Lepomis gibbosus
## 3 1700 Notemigonus crysoleucas
## 4 1700 Cyprinella analostana
## 5 1700 Notropis scepticus
## 6 1700 Notropis procne
## 7 1822
## 8 1845
## 9 1846
## 10 1846
## # ... with 65,489 more rows
Use the helper function desc
to arrange in descending rather than ascending order.
select(arrange(bison, desc(year)), year, ITISscientificName)
## # A tibble: 65,499 × 2
## year ITISscientificName
## <int> <chr>
## 1 2015 Arisaema triphyllum
## 2 2015 Onoclea sensibilis
## 3 2015 Viola striata
## 4 2015 Solidago pinetorum
## 5 2015 Glechoma hederacea
## 6 2015 Fragaria virginiana
## 7 2015 Scutellaria integrifolia
## 8 2015 Onoclea sensibilis
## 9 2015
## 10 2015 Goodyera pubescens
## # ... with 65,489 more rows
This works for characters as well, but you should make note of how the sorting handles blanks and NA values. See the tips section below.
select(arrange(bison, ITISscientificName), year, ITISscientificName)
## # A tibble: 65,499 × 2
## year ITISscientificName
## <int> <chr>
## 1 1977
## 2 1977
## 3 1977
## 4 1977
## 5 2002
## 6 2002
## 7 2008
## 8 1992
## 9 1977
## 10 1977
## # ... with 65,489 more rows
select(arrange(bison, desc(ITISscientificName)), year, ITISscientificName)
## # A tibble: 65,499 × 2
## year ITISscientificName
## <int> <chr>
## 1 2012 Zonotrichia leucophrys
## 2 2009 Zonotrichia leucophrys
## 3 2010 Zonotrichia leucophrys
## 4 2012 Zonotrichia leucophrys
## 5 2013 Zonotrichia leucophrys
## 6 2012 Zonotrichia leucophrys
## 7 2012 Zonotrichia leucophrys
## 8 2013 Zonotrichia leucophrys
## 9 2011 Zonotrichia leucophrys
## 10 2011 Zonotrichia leucophrys
## # ... with 65,489 more rows
To demonstrate the rules arrange
uses for character vectors we can also demonstrate how to create a tibble or data frame from scratch.
See chapter 10 of R for Data Science for more detailed explanations of this process.
Pay attention to what variable is being used to arrange here.
test <- tibble(name_lower = c("a", "q", "s", "z", "", NA, "ab", "aq", "!", "?"),
name_upper = c("Q", NA, "A", "", "B", "Z", "AB", "AQ", "!", "?"),
name_all = c("A", NA, "a", "", "z", "Z", "Az", "AZ", "!?", "!!"))
arrange(test, name_lower)
## # A tibble: 10 × 3
## name_lower name_upper name_all
## <chr> <chr> <chr>
## 1 B z
## 2 ! ! !?
## 3 ? ? !!
## 4 a Q A
## 5 ab AB Az
## 6 aq AQ AZ
## 7 q <NA> <NA>
## 8 s A a
## 9 z
## 10 <NA> Z Z
arrange(test, desc(name_lower))
## # A tibble: 10 × 3
## name_lower name_upper name_all
## <chr> <chr> <chr>
## 1 z
## 2 s A a
## 3 q <NA> <NA>
## 4 aq AQ AZ
## 5 ab AB Az
## 6 a Q A
## 7 ? ? !!
## 8 ! ! !?
## 9 B z
## 10 <NA> Z Z
arrange(test, name_upper)
## # A tibble: 10 × 3
## name_lower name_upper name_all
## <chr> <chr> <chr>
## 1 z
## 2 ! ! !?
## 3 ? ? !!
## 4 s A a
## 5 ab AB Az
## 6 aq AQ AZ
## 7 B z
## 8 a Q A
## 9 <NA> Z Z
## 10 q <NA> <NA>
arrange(test, desc(name_upper))
## # A tibble: 10 × 3
## name_lower name_upper name_all
## <chr> <chr> <chr>
## 1 <NA> Z Z
## 2 a Q A
## 3 B z
## 4 aq AQ AZ
## 5 ab AB Az
## 6 s A a
## 7 ? ? !!
## 8 ! ! !?
## 9 z
## 10 q <NA> <NA>
arrange(test, name_all)
## # A tibble: 10 × 3
## name_lower name_upper name_all
## <chr> <chr> <chr>
## 1 z
## 2 ? ? !!
## 3 ! ! !?
## 4 s A a
## 5 a Q A
## 6 ab AB Az
## 7 aq AQ AZ
## 8 B z
## 9 <NA> Z Z
## 10 q <NA> <NA>
arrange(test, desc(name_all))
## # A tibble: 10 × 3
## name_lower name_upper name_all
## <chr> <chr> <chr>
## 1 <NA> Z Z
## 2 B z
## 3 aq AQ AZ
## 4 ab AB Az
## 5 a Q A
## 6 s A a
## 7 ! ! !?
## 8 ? ? !!
## 9 z
## 10 q <NA> <NA>
Those demos lead to a few handy tips.
arrange
Some things to keep in mind when using arrange:
NA values are placed at the end even when sorting in descending order. That holds for both numeric and character vectors.
Blank characters are “smaller” than any other character.
Be careful with punctuation ordering. Experiment. But really, you shouldn’t be trying to order by punctuation because… what does that even mean?
Letters appearing later in the alphabet are “bigger”. When using increasing order, “a” appears before “z” for example.
Capital letters are “bigger” than lower-case letters. So “A” appears before “a” when using descending order.
Ordering begins with the first letter then proceeds letter-by-letter. “ab” therefore is “smaller” than “az” and will appear before it when using ascending order.
You can arrange using multiple criteria, as well, where the order in which you list variables in arrange
determines on which variables you sort first.
select(arrange(bison, desc(year), desc(ITISscientificName)), year, ITISscientificName)
## # A tibble: 65,499 × 2
## year ITISscientificName
## <int> <chr>
## 1 2015 Woodwardia areolata
## 2 2015 Viola striata
## 3 2015 Viola striata
## 4 2015 Viola sororia
## 5 2015 Viola bicolor
## 6 2015 Veronica hederifolia
## 7 2015 Verbascum blattaria
## 8 2015 Vaccinium tenellum
## 9 2015 Turdus migratorius
## 10 2015 Triodanis perfoliata
## # ... with 65,489 more rows
select(arrange(bison, desc(ITISscientificName), desc(year)), year, ITISscientificName)
## # A tibble: 65,499 × 2
## year ITISscientificName
## <int> <chr>
## 1 2013 Zonotrichia leucophrys
## 2 2013 Zonotrichia leucophrys
## 3 2013 Zonotrichia leucophrys
## 4 2013 Zonotrichia leucophrys
## 5 2013 Zonotrichia leucophrys
## 6 2012 Zonotrichia leucophrys
## 7 2012 Zonotrichia leucophrys
## 8 2012 Zonotrichia leucophrys
## 9 2012 Zonotrichia leucophrys
## 10 2012 Zonotrichia leucophrys
## # ... with 65,489 more rows
mutate:
alter and create variablesCommon buckeye butterfly. Source: iNaturalist
mutate
, mutate_eachYou’ll want to use mutate
just about any time you want to create a new variable in your data frame or alter and overwrite an existing variable.
If you plan to use the same function on multiple variables, use mutate_each
.
The syntax for mutate
is \[\text{data_frame_2 <- mutate(data_frame_1, unquoted_variable_name = something)} \]
Syntax for mutate_each
is \[\text{data_frame_2 <- mutate(data_frame_1, funs(function1_name, function2_name), unquoted variable name or helper function as in } select)\]
mutate_each
will apply each function in the funs brackets to each of the columns selected. If no columns are selected, it applies each function to all. You can make custom functions to use with mutate_each
, but we will only see that later in the class.
Often, data frame one and two in the statement above will be the same: You are overwriting your existing data frame and adding or altering a variable in it.
Some things we want to do using mutate
:
Convert latitude and longitude (currently in degrees) to radians, to calculate “as-the-crow-flies” distances between points using the haversine formula. See the grouping section below for how we apply that formula.
Make the eventDate variable a date/time type object. See the bonus section below for more on dates an times. Otherwise you don’t need to worry yet about what we did, other than to recognize the basic syntax for mutate.
Some things we want to do using mutate_each
:
bison <- mutate(bison,
latitude_rad = pi * latitude / 180,
longitude_rad = pi * longitude / 180,
eventDate = parse_datetime(eventDate))
bison <- mutate_each(bison, funs(tolower), contains("name"))
select(bison, contains("tude"), eventDate)
## # A tibble: 65,499 × 5
## latitude longitude latitude_rad longitude_rad eventDate
## <dbl> <dbl> <dbl> <dbl> <dttm>
## 1 35.9283 -79.084 0.6270671 -1.380276 2012-12-29
## 2 35.9283 -79.084 0.6270671 -1.380276 2012-12-21
## 3 35.9283 -79.084 0.6270671 -1.380276 2012-12-30
## 4 35.9283 -79.084 0.6270671 -1.380276 2012-12-21
## 5 35.9283 -79.084 0.6270671 -1.380276 2012-12-29
## 6 35.9283 -79.084 0.6270671 -1.380276 2013-02-23
## 7 35.9283 -79.084 0.6270671 -1.380276 2013-02-23
## 8 35.9283 -79.084 0.6270671 -1.380276 2013-02-23
## 9 35.9283 -79.084 0.6270671 -1.380276 2013-02-23
## 10 35.9283 -79.084 0.6270671 -1.380276 2013-02-23
## # ... with 65,489 more rows
select(bison, contains("name"))
## # A tibble: 65,499 × 5
## providedScientificName scientificName
## <chr> <chr>
## 1 setophaga coronata setophaga coronata
## 2 sialia sialis (linnaeus, 1758) sialia sialis
## 3 thryothorus ludovicianus (latham, 1790) thryothorus ludovicianus
## 4 zenaida macroura (linnaeus, 1758) zenaida macroura
## 5 thryothorus ludovicianus (latham, 1790) thryothorus ludovicianus
## 6 sitta carolinensis latham, 1790 sitta carolinensis
## 7 pipilo erythrophthalmus (linnaeus, 1758) pipilo erythrophthalmus
## 8 setophaga coronata setophaga coronata
## 9 toxostoma rufum (linnaeus, 1758) toxostoma rufum
## 10 cardinalis cardinalis (linnaeus, 1758) cardinalis cardinalis
## # ... with 65,489 more rows, and 3 more variables:
## # ITISscientificName <chr>, providedCommonName <chr>,
## # ITIScommonName <chr>
The syntax is exactly the same, except that to create a new variable use a new, unquoted variable name on the left-hand side of the equality statement.
This is a just a dummy example.
select(mutate(bison, double_year = year * 2), year, double_year)
## # A tibble: 65,499 × 2
## year double_year
## <int> <dbl>
## 1 2012 4024
## 2 2012 4024
## 3 2012 4024
## 4 2012 4024
## 5 2012 4024
## 6 2013 4026
## 7 2013 4026
## 8 2013 4026
## 9 2013 4026
## 10 2013 4026
## # ... with 65,489 more rows
As a quick fix to our missing scientific name problem, let’s use mutate
to create a new scientific name variable using the ITIS names when available and the alternate scientific name variable when not.
We are creating an alternate variable instead of over-writing the current one so we can investigate the scope of the problem later.
bison <- mutate(bison,
scientificName_complete = ifelse(ITISscientificName == "", scientificName, ITISscientificName))
filter(bison, scientificName_complete == "" | is.na(scientificName_complete))
## # A tibble: 0 × 15
## # ... with 15 variables: bisonID <dbl>, providedScientificName <chr>,
## # scientificName <chr>, ITISscientificName <chr>,
## # providedCommonName <chr>, ITIScommonName <chr>, providerID <int>,
## # institutionCode <chr>, latitude <dbl>, longitude <dbl>,
## # eventDate <dttm>, year <int>, latitude_rad <dbl>, longitude_rad <dbl>,
## # scientificName_complete <chr>
That’s all we’ll say for now.
Throughout the course and the rest of this lecture, you will see many more applications of mutate`
Northern hog sucker. Source: Wikipedia
The ‘pipe’ operation is a handy tool to make your code more legible: %>%
. Key points:
It takes the output of your previous operation and uses it as an input to your next operation.
You can determine where the previous argument goes with the period symbol, .
, which acts as a placeholder.
Understand how to use it by replacing the pipe operation with ‘then’ (in your mind, not in the code). For example, filter(data, ...) %>% select(...)
filters first then selects columns from the output of filter.
Often, you will be doing multiple things at a time to your data set and the intermediate steps don’t matter.
In R, for reasons we won’t talk about, you want to avoid creating lots of unnecessary objects in your environment. Using pipes help you do that.
Finally, pipes make your code more legible. For example
c(1, 2, 3, 4) ^ 2 %>%
sum %>%
sqrt %>%
log %>%
sin
## [1] 0.9915875
is more legible than
sin(log(sqrt(sum(c(1, 2, 3, 4) ^ 2))))
## [1] 0.9915875
and it only becomes worse when using functions with lots of arguments.
From now on, we will use pipes whenever convenient. Check out the textbook and experiment with pipes to learn their uses and limitations.
Yellow crownbeard. Source: virginiawildflowers.org
dplyr
function group_by
is a helper function for a number of other tools in the package. It essentially creates hidden links between observations in your dataset based on variables you determine.
The group identifications themselves are not variables but instead allow you to do operations constrained to each group separately in the mutate
, summarise
and filter
functions.
That’s abstract, so let’s move to some examples with mutate
.
group_by
You will always use group_by
first, before running mutate
, summarise
or filter
and their related functions.
Some common situations for when you will want to have grouping variables:
filter
: Filter for characteristics by subgroup, such as the best or worst observation by some measure within each group. That returns all observations meeting the criteria within their groups.You can also filter based on conditions related to the group as a whole, such as the number of observations in the group. That returns a dataset with all observations in a group meeting the criterion.
With mutate
: Create or alter variables in a way that depends on group-level characteristics, for example calculating the difference between a particular observation’s value and average for that observation’s group. That’s vague, so see the examples in the book and below.
With summarise
: Create group-level summaries, such as the total number of observations by species in the BISON dataset. The result is a data frame (actually a tibble) with the summary statistics for each group.
We will do ungrouped and grouped summaries all together in the section on summarise
.
A WARNING: Grouped operations for filter and mutate can become tricky, so you should be extra careful to be sure what you tried to do is in fact what you did.
So-called window functions work best for grouped operations with mutate
and filter
, and summary functions—such as sum
or mean
—are what you want for summarise
.
As this demo on window functions says,
A window function is a variation on an aggregation function. Where an aggregation function, like
sum()
andmean()
, takes n inputs and return a single value, a window function returns n values.
The output of a window function depends on all its input values, so window functions don’t include functions that work element-wise, like + or
round()
.
Some useful ‘window’ functions that are appropriate for the group_by
plus mutate
or filter
combination:
ranking: such as min_rank
and percent_rank and cume_dist
Cumulative: such as cumsum
Time series: such as lead
and lag
functions that offset the input vector by negative one and one positions, respectively. Be sure that whatever order your vector is in when using those actually has meaning, such as a time-dependent order.
Learn more about those functions and check out some examples at the window functions demo link above.
group_by
with filter
Let’s filter our dataset to include only the species that appear at least 200 times in the Orange County biodiversity dataset.
To do so, we’ll use the helper function n()
which just counts the number of observations within a group. It only works inside other dplyr
functions.
group_by(bison, scientificName_complete) %>%
filter(n() > 200)
## Source: local data frame [39,150 x 15]
## Groups: scientificName_complete [73]
##
## bisonID providedScientificName
## <dbl> <chr>
## 1 844239256 setophaga coronata
## 2 844267315 sialia sialis (linnaeus, 1758)
## 3 846018158 thryothorus ludovicianus (latham, 1790)
## 4 846916268 zenaida macroura (linnaeus, 1758)
## 5 846967932 thryothorus ludovicianus (latham, 1790)
## 6 940794719 sitta carolinensis latham, 1790
## 7 940794958 pipilo erythrophthalmus (linnaeus, 1758)
## 8 940795061 setophaga coronata
## 9 940795118 toxostoma rufum (linnaeus, 1758)
## 10 940795350 cardinalis cardinalis (linnaeus, 1758)
## # ... with 39,140 more rows, and 13 more variables: scientificName <chr>,
## # ITISscientificName <chr>, providedCommonName <chr>,
## # ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## # latitude <dbl>, longitude <dbl>, eventDate <dttm>, year <int>,
## # latitude_rad <dbl>, longitude_rad <dbl>, scientificName_complete <chr>
You can combine the grouped filter with ungrouped filtering. But this is where you have to think carefully.
Here we’ll filter for the species seen at least 3 times in the entire dataset, but we will only want the observations of those species that appear in 2015.
group_by(bison, scientificName_complete) %>%
filter(n() > 3, year == 2015)
## Source: local data frame [125 x 15]
## Groups: scientificName_complete [90]
##
## bisonID providedScientificName
## <dbl> <chr>
## 1 1092883743 arisaema triphyllum (l.) schott
## 2 1092883702 onoclea sensibilis l.
## 3 1143531724 solidago pinetorum small
## 4 1088924502 glecoma hederacea l.
## 5 1088947111 fragaria virginiana mill.
## 6 1143531759 scutellaria integrifolia l.
## 7 1092895216 onoclea sensibilis l.
## 8 1134968123 goodyera pubescens (willd.) r.br.
## 9 1092905293 clematis viorna l.
## 10 1088924837 erythronium umbilicatum c.r.parks & hardin
## # ... with 115 more rows, and 13 more variables: scientificName <chr>,
## # ITISscientificName <chr>, providedCommonName <chr>,
## # ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## # latitude <dbl>, longitude <dbl>, eventDate <dttm>, year <int>,
## # latitude_rad <dbl>, longitude_rad <dbl>, scientificName_complete <chr>
Here we’ll filter for the species seen at least 3 times in 2015. To do so we need to group by year as well as scientific name.
group_by(bison, scientificName_complete, year) %>%
filter(n() > 3, year == 2015)
## Source: local data frame [8 x 15]
## Groups: scientificName_complete, year [2]
##
## bisonID providedScientificName
## <dbl> <chr>
## 1 1088951040 claytonia virginica l.
## 2 1088924555 claytonia virginica l.
## 3 1088924584 claytonia virginica l.
## 4 1088926416 claytonia virginica l.
## 5 1088924185 asplenium platyneuron (l.) britton, sterns & poggenb.
## 6 1088924277 asplenium platyneuron (l.) britton, sterns & poggenb.
## 7 1088898166 asplenium platyneuron (l.) britton, sterns & poggenb.
## 8 1088957895 asplenium platyneuron (l.) britton, sterns & poggenb.
## # ... with 13 more variables: scientificName <chr>,
## # ITISscientificName <chr>, providedCommonName <chr>,
## # ITIScommonName <chr>, providerID <int>, institutionCode <chr>,
## # latitude <dbl>, longitude <dbl>, eventDate <dttm>, year <int>,
## # latitude_rad <dbl>, longitude_rad <dbl>, scientificName_complete <chr>
Notice that in each case we get back all observations in the groups meeting the criteria.
group_by
with mutate
Ebony spleenwort, one of only two species seen at least three times in 2015 in the Orange County BISON data. Source: Stephen F. Austin State University
We’ll see a few examples to give you a flavor for what you can do.
Unfortunately, this dataset isn’t great for demonstrating grouped mutate functions. See the textbook or the window function vignette above for more examples.
Let’s give it a shot though, with a contrived example. A more interesting and more advanced application of this kind of thing is in the bonus section on as-the-crow-flies distances.
We will:
group by organization providing a species sighting using provider ID and by year.
calculate the time difference in weeks between the first and final observations of the year for that organization.
Notice that you can also use pipes within mutate
.
group_by(bison, providerID, year) %>%
mutate(d_time = as.numeric(max(eventDate) - min(eventDate), "weeks")) %>%
select(year, providerID, d_time)
## Source: local data frame [65,499 x 3]
## Groups: providerID, year [520]
##
## year providerID d_time
## <int> <int> <dbl>
## 1 2012 602 52.14286
## 2 2012 602 52.14286
## 3 2012 602 52.14286
## 4 2012 602 52.14286
## 5 2012 602 52.14286
## 6 2013 602 52.00000
## 7 2013 602 52.00000
## 8 2013 602 52.00000
## 9 2013 602 52.00000
## 10 2013 602 52.00000
## # ... with 65,489 more rows
Now do something similar but for species by year, but also computing the sightings per week for each species, each year, for the span over which they were observed.
Notice that mutate
allows you to use newly created variables in subsequent statements. That is common throughout the dplyr
functions.
group_by(bison, scientificName_complete, year) %>%
mutate(d_time = as.numeric(max(eventDate) - min(eventDate), "weeks"),
avg_obs = n() / d_time) %>%
select(year, scientificName_complete, ITIScommonName, d_time, avg_obs)
## Source: local data frame [65,499 x 5]
## Groups: scientificName_complete, year [9,159]
##
## year scientificName_complete
## <int> <chr>
## 1 2012 setophaga coronata
## 2 2012 sialia sialis
## 3 2012 thryothorus ludovicianus
## 4 2012 zenaida macroura
## 5 2012 thryothorus ludovicianus
## 6 2013 sitta carolinensis
## 7 2013 pipilo erythrophthalmus
## 8 2013 setophaga coronata
## 9 2013 toxostoma rufum
## 10 2013 cardinalis cardinalis
## # ... with 65,489 more rows, and 3 more variables: ITIScommonName <chr>,
## # d_time <dbl>, avg_obs <dbl>
Lastly, we will calculate the time between observations for each species in a year using the lag function. Remember that first we have to arrange the data in increasing time order for this to make sense.
Using arrange
before vs. after the grouping statement will make a difference.
Arranging before the grouping puts all observations in time order. Arranging after grouping puts observations in time order within their groups.
For example, say we have observations for a Carolina wren on Dec. 12, 2013 and Jan. 2, 2014, and for a bluebird on Dec. 20, 2013 and Dec. 31 2013.
Arranging before grouping will give this order: wren Dec. 12, bluebird Dec. 20, bluebird Dec. 31, and wren Jan. 2
Arranging after grouping gives this order: (wren Dec. 12, wren Jan. 2) and (bluebird Dec. 20, bluebird Dec. 31)
On the data:
group_by(bison, scientificName_complete) %>%
mutate(d_time = as.numeric(eventDate - lag(eventDate), "weeks")) %>%
select(scientificName_complete, d_time) %>%
arrange(desc(d_time))
## Source: local data frame [65,499 x 2]
## Groups: scientificName_complete [3,161]
##
## scientificName_complete d_time
## <chr> <dbl>
## 1 notropis procne 14933.143
## 2 lepomis gibbosus 14171.571
## 3 notemigonus crysoleucas 13550.857
## 4 cyprinella analostana 13550.857
## 5 lepomis auritus 13072.571
## 6 notropis scepticus 13005.000
## 7 carya glabra 4127.429
## 8 spizella passerina 3759.429
## 9 helmitheros vermivorum 3622.286
## 10 dichanthelium acuminatum 3555.429
## # ... with 65,489 more rows
We will get NA values if there are no previous observations.
Swallowtail shiner. More than 250 years passed between observations of it in the BISON dataset. Does that make sense? Source: USGS
You often have to check that you did the somewhat complicated grouped mutation operations correctly. Let’s do that using some of the tools we learned.
First, let’s check that crazy number in the final step above. The swallow-tailed shiner apparently had almost 300 years between observations in our data. Is that true?
filter(bison, scientificName_complete == "notropis procne")$eventDate %>% sort
## [1] "1700-01-01 UTC" "1955-04-21 UTC" "1961-08-02 UTC" "1961-08-02 UTC"
## [5] "1961-08-02 UTC" "1961-08-02 UTC" "1961-08-16 UTC" "1961-08-16 UTC"
## [9] "1962-09-25 UTC" "1972-06-12 UTC" "1977-10-21 UTC" "1984-02-18 UTC"
## [13] "1986-01-25 UTC" "1986-03-15 UTC" "1987-06-28 UTC" "1987-06-28 UTC"
## [17] "1992-03-16 UTC" "1992-03-16 UTC" "1998-05-18 UTC" "1998-08-03 UTC"
## [21] "1998-08-03 UTC" "2003-06-12 UTC" "2003-06-12 UTC" "2003-06-12 UTC"
## [25] "2003-06-12 UTC" "2003-06-12 UTC" "2003-09-30 UTC" "2003-09-30 UTC"
## [29] "2003-09-30 UTC" "2003-09-30 UTC" "2005-04-14 UTC" "2005-04-14 UTC"
## [33] "2005-04-14 UTC" "2005-04-15 UTC" "2005-04-15 UTC"
Yes, that makes sense. There’s a single observation in 1700, and the next one is more than 250 years later.
It would be worth checking with the BISON documentation to see where they’d even get a record from Jan. 1, 1700.
Now we will check calculations on time difference between observations submitted by provider.
test <- filter(bison, providerID == 291, year == 1984)
select(test, providerID, institutionCode, eventDate)
## # A tibble: 47 × 3
## providerID institutionCode eventDate
## <int> <chr> <dttm>
## 1 291 North Carolina State Museum of Natural Sciences 1984-08-15
## 2 291 North Carolina State Museum of Natural Sciences 1984-08-15
## 3 291 North Carolina State Museum of Natural Sciences 1984-08-15
## 4 291 North Carolina State Museum of Natural Sciences 1984-08-15
## 5 291 North Carolina State Museum of Natural Sciences 1984-07-09
## 6 291 North Carolina State Museum of Natural Sciences 1984-07-09
## 7 291 North Carolina State Museum of Natural Sciences 1984-07-09
## 8 291 North Carolina State Museum of Natural Sciences 1984-07-09
## 9 291 North Carolina State Museum of Natural Sciences 1984-07-09
## 10 291 North Carolina State Museum of Natural Sciences 1984-07-09
## # ... with 37 more rows
max(test$eventDate)
## [1] "1984-12-30 UTC"
min(test$eventDate)
## [1] "1984-02-18 UTC"
# Note we don't need group_by since we've filtered to a single group
mutate(test, d_time = as.numeric(max(eventDate) - min(eventDate), "weeks")) %>%
select(year, providerID, d_time)
## # A tibble: 47 × 3
## year providerID d_time
## <int> <int> <dbl>
## 1 1984 291 45.14286
## 2 1984 291 45.14286
## 3 1984 291 45.14286
## 4 1984 291 45.14286
## 5 1984 291 45.14286
## 6 1984 291 45.14286
## 7 1984 291 45.14286
## 8 1984 291 45.14286
## 9 1984 291 45.14286
## 10 1984 291 45.14286
## # ... with 37 more rows
which looks correct.
White-tinged sedge. Source: Missouri Botanical Gardens.
Some minor notes:
grouped operations like those above return data frames (tibbles) with group information associated with them
Remove group labels using ungroup(data frame)
. You cannot alter grouping variables in mutate and will get an error message. Other functions sometimes disallow operations on grouping variables.
return a list of grouping variables from a grouped data frame using groups(data frame)
use group_size
to return a vector, the same length as the number of observations, giving the group size of the group associated with that observation
group_indices
gives a vector with the group index associated to a particular observation
Whitebanded crab spider. Source: carolinanature.com
summarise
and summarise_each
Summary functions are for creating summary information, of course.
In other words use summarise
to return a single summary statistic for a dataset or for subgroups of a dataset (using group_by
first)
Use summarise_each
if you want a series of the same summary statistics for different variables.
Finally, we can answer some basic questions about our dataset using summarise
and some of the functions you saw above.
How many observations are there in the data per species?
group_by(bison, scientificName_complete) %>%
summarise(n())
## # A tibble: 3,161 × 2
## scientificName_complete `n()`
## <chr> <int>
## 1 ablabesmyia 6
## 2 abutilon theophrasti 1
## 3 acalypha 1
## 4 acalypha gracilens 1
## 5 acalypha ostryifolia 1
## 6 acalypha rhomboidea 1
## 7 acalypha virginica 4
## 8 acanthodrilidae 1
## 9 acari 4
## 10 accipiter cooperii 58
## # ... with 3,151 more rows
What are the top-10 most observed species across all years?
Notice we can rename the variables to be something more legible in the output table.
group_by(bison, scientificName_complete) %>%
summarise(N_species = n()) %>%
arrange(desc(N_species))
## # A tibble: 3,161 × 2
## scientificName_complete N_species
## <chr> <int>
## 1 cardinalis cardinalis 1654
## 2 thryothorus ludovicianus 1521
## 3 poecile carolinensis 1478
## 4 baeolophus bicolor 1459
## 5 pinus taeda 1286
## 6 melanerpes carolinus 1208
## 7 corvus brachyrhynchos 1186
## 8 liquidambar styraciflua 1133
## 9 acer rubrum 995
## 10 zenaida macroura 946
## # ... with 3,151 more rows
Northern cardinal female, the most frequently observed species in our dataset. Source: Wikipedia.
We can combine the grouped mutate and summarize operations to learn average number of weeks between observations per species.
group_by(bison, scientificName_complete) %>%
mutate(d_time = as.numeric(eventDate - lag(eventDate), "weeks")) %>%
summarise(avg_d_time = mean(d_time, na.rm = TRUE)) %>%
arrange(desc(avg_d_time))
## # A tibble: 3,161 × 2
## scientificName_complete avg_d_time
## <chr> <dbl>
## 1 rhizopogon rubescens 3285.714
## 2 ancyloxypha numitor 3143.101
## 3 amanita excelsa 2658.286
## 4 penstemon digitalis 2388.714
## 5 clasmatodon parvulus 2145.714
## 6 marmota monax 1980.857
## 7 luzula acuminata carolinae 1722.143
## 8 lasmigona subviridis 1695.286
## 9 carex muehlenbergii muehlenbergii 1673.714
## 10 glyceria striata 1626.429
## # ... with 3,151 more rows
Rhizopogon rubescens has the largest inter-observation time of any species in the dataset. Source: Wikipedia
The next example shows you can
use computed values as grouping variables
change groupings between operations
assign group names within group_by
group_by(bison, scientificName_complete) %>%
mutate(d_time = as.numeric(eventDate - lag(eventDate), "weeks")) %>%
ungroup %>%
group_by(long_btw_obs = d_time > 52) %>%
summarise(N_obs = n())
## # A tibble: 3 × 2
## long_btw_obs N_obs
## <lgl> <int>
## 1 FALSE 48956
## 2 TRUE 11694
## 3 NA 4849
The table shows that by far most observations were within a year of the previous observation for that same species.
NA values are likely from those first observations for each species, though we should check there are no others.
summarise_each
has essentially the same syntax as mutate_each
and represents the same concept, just for summarization.
The silly example below returns the maximum length between observations as well as the maximum providerID number for each subspecies.
group_by(bison, scientificName_complete) %>%
mutate(d_time = as.numeric(eventDate - lag(eventDate), "weeks")) %>%
summarise_each(funs(max = max(., na.rm = TRUE)), d_time, providerID)
## # A tibble: 3,161 × 3
## scientificName_complete d_time_max providerID_max
## <chr> <dbl> <int>
## 1 ablabesmyia 0.5714286 440
## 2 abutilon theophrasti NA 440
## 3 acalypha NA 440
## 4 acalypha gracilens NA 440
## 5 acalypha ostryifolia NA 407
## 6 acalypha rhomboidea NA 440
## 7 acalypha virginica 104.4285714 440
## 8 acanthodrilidae NA 220
## 9 acari 0.5714286 440
## 10 accipiter cooperii 579.2857143 602
## # ... with 3,151 more rows
The only new trick there was that I defined a new function within the funs helper, specifying I wanted to remove NA values and using the dot as a placeholder for where my variable should go.
We still got some NA values for species with only one observation.
Grouped summaries can be an important tool for assessing data quality. Once you have spotted something weird, you can ask yourself:
are the errors concentrated?
if so, with which observations?
Doing so can help you think about patterns, which in turn lead you to possible reasons and fixes for the data quality problems.
First, we can now look more closely at the missing ITIS scientific names problem.
group_by(bison, providerID) %>%
summarise(N_errors = sum(ITISscientificName == "")) %>%
arrange(desc(N_errors))
## # A tibble: 64 × 2
## providerID N_errors
## <int> <int>
## 1 440 1271
## 2 214 472
## 3 136 130
## 4 167 53
## 5 407 33
## 6 120 32
## 7 291 19
## 8 220 14
## 9 602 10
## 10 111 4
## # ... with 54 more rows
filter(bison, providerID == 440) %>% select(institutionCode)
## # A tibble: 30,563 × 1
## institutionCode
## <chr>
## 1 BISON
## 2 BISON
## 3 BISON
## 4 BISON
## 5 BISON
## 6 BISON
## 7 BISON
## 8 BISON
## 9 BISON
## 10 BISON
## # ... with 30,553 more rows
You could then refer to the dataset documentation, you would hope, to understand why observations from observer BISON have so many missing scientific names. You could also try grouping by year as well to further isolate the problem.
Fish crow. Source: Audobon
A question we might be interested for this dataset is: What is the maximum distance between two observations of the same species within Orange County?
Distances between points are not yet a variable. But we have coordinates for where species were sighted.
We can approximate those distances using the haversine formula.
haversine <- function(lat1, lat2, long1, long2, in_miles = TRUE){
message("Returns great-circle distance between two points on earth in km or miles. Latitude and longitude should be in radians.")
h <- sin(.5 * (lat2 - lat1)) ^ 2 + cos(lat1) * cos(lat2) * (sin(.5 * (long2 - long1)) ^ 2)
# earth diameter taken from http://imagine.gsfc.nasa.gov/features/cosmic/earth_info.html
r <- 12756 / 2
if (in_miles)
r <- r * 0.621371
return(2 * r * asin(sqrt(h)))
}
Don’t worry about the details of this, or read about them on Wikipedia.
The haversine formula is an old nautical tool to calculate distances along the surface of a sphere—a good approximation to distances along the surface of the almost-spherical earth.
Let’s test it out. We’ll compare the approximation to the NASA terrestrial planet mileage calculator, an exact calculation.
Coordinates of Hanes building at UNC-Chapel Hill are Lat: 35.910810, Long: -79.051205 , and rough coordinates at Nuuk, capital of Greenland are Lat: 64.182179, Long: -51.688468.
haversine(lat1 = pi * 35.910810 / 180, lat2 = pi * 64.182179 / 180, long1 = pi * -79.051205 / 180, long2 = pi * -51.688468 / 180)
## Returns great-circle distance between two points on earth in km or miles. Latitude and longitude should be in radians.
## [1] 2263.758
Answer from NASA: 2274.5 miles.
The haversine formula gives a very good approximation to NASA’s figure, particularly since it is over such a long distance.
All of our points are in Orange County, so the error won’t even matter. For example, here’s the haversine distance estimate the Hanes building to Duke Chapel.
haversine(lat1 = pi * 35.910810 / 180, lat2 = pi * 36.001756 / 180, long1 = pi * -79.051205 / 180, long2 = pi * -78.939975 / 180)
## [1] 8.851984
Answer from NASA: 8.9 miles.
NASA’s tool rounds to the first decimal, so the answer are likely much closer.
Why bother with this bonus section?
Geo-location information is some of the most common and interesting data you will come across. Now, just with latitude and longitude, you can extract as-the-crow flies distances!