This lecture covers tidy data. The main reference is

The focus of this lecture is how to reshape datasets to be ready for analysis using functions from the tidyr package package - separate - unite - gather - spread

The Museum of Modern Art (MoMA) publicly releases data about its entire collection (see MoMA collection). This lecture uses a couple data files from their collection.

Some initial data cleaning

We first do some initial data cleaning to make life easier

library(tidyverse)
library(stringr) # this should come with tidyverse, but code wont run without import

# load data
artists <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/moma_artists_jan2017.csv')
art <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/moma_art_jan2017.csv')

# Annoying weird issue with first column name
colnames(art) <- c('title', tolower(colnames(art))[-1])
colnames(artists) <- c('id', tolower(colnames(artists))[-1])

# To give us something to do, removing artist information leaving only the artist uniqueid (constituentid) for joining
art <- select(art, -artist, -artistbio, -nationality, -begindate, -enddate, -gender)

# Removing artwork with NA constituent ID. Lots of NA values in their entries, and not worth finding out what's up for this demonstration.
# There are
art <- filter(art, !is.na(constituentid))

art <- mutate(art, 
classification = str_replace_all(classification, "[[:punct:]]", ""),
classification = str_replace_all(classification, "\\s", "_"),
constituentid = str_trim(constituentid, side = "both"),
objectid = str_trim(objectid, side = "both"))

artists <- mutate(artists, 
                  id = str_trim(id, side = "both"))

# For some reason, certain artworks list the same constituentid twice. Not worth explaining in the lecture. 

# We correct this in the lecture, but you could correct it here with the code below

# art$constituentid <- str_split(art$constituentid, ",") %>% lapply(., function(x){unique(x) %>% paste(., collapse = ",")}) %>% unlist

These notes instead will walk through how to apply those concepts on our sample dataset for the lecture. The goal is to demonstrate how you might apply tidy and relational data concepts, and to show some issues you could run into.

If you do something cool with this dataset let the instructor know. The MoMA is interested in showing off what people do with it Show in New WindowClear OutputExpand/Collapse Output [1] “27” “8102, 6703, 4312” “8102, 6703, 4312” “8102, 4312, 6703” “8102, 6703, 4312” [6] “2702” “2702” “8219” “2702” “8219”
[11] “8101”
Show in New WindowClear OutputExpand/Collapse Output Show Traceback Error in eval(expr, envir, enclos) : object ‘constituentid’ not found Show in New WindowClear OutputExpand/Collapse Output Too many values at 31 locations: 1, 2, 3, 4, 5, 6, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, … R Console

1
2
3
4
5
6
7
8
9
10
Next12345Previous 1-10 of 41 rows | 1-1 of 24 columns tbl_df 41 x 24

1
2
3
4
5
6
7
8
9
10
Next12345Previous 1-10 of 41 rows | 1-1 of 24 columns Show in New WindowClear OutputExpand/Collapse Output [1] 30 Show in New WindowClear OutputExpand/Collapse Output

artistid1 artistid2 artistid3 artistid4 artistid5 1 6956 7559 7558 6957 NA 2 8213 6132 8214 NA NA 3 8213 6132 8214 8216 NA 4 8215 6132 8214 8216 NA 5 8215 6132 8214 8216 NA 6 8215 6132 8214 8216 NA 7 6076 8112 NA NA NA 8 6934 8103 NA NA NA 9 6934 8103 NA NA NA 10 8218 22884 NA NA NA Next12345Previous 1-10 of 41 rows Show in New WindowClear OutputExpand/Collapse Output Show Traceback Error in 1:n : result would be too long a vectors data!

Dataset

Wangechi Mutu, Untitled from Eve, 2006 MoMA

Wangechi Mutu, Untitled from Eve, 2006 MoMA

The Museum of Modern Art in New York published datasets of the art and artists in its collection on GitHub.

  • the art dataset has almost 130,000 records, one for each work and includes information about the piece title, the artists responsible, its physical properties and a classification into art types, such as painting.

  • the artist dataset has one record per artist, with basic biographical information.

Datasets are updated regularly: The ones we will use are from the first week of July, 2016. See the link to MoMA’s GitHub page above for the museum’s request for how to use the data.

What do we want to know?

Any project starts with a series of preliminary questions, things we want to know about the data.

Ones we will focus on are listed below. You could think of a number of others.

  • Who is the most prevalent artist in the collection in terms of number of works?
  • What is the nationality distribution i.e. artist nationality of works in the collection?
  • Which artist has the most diverse body of work in terms of classifications? For example, we want to know which artist has the most number of works from different media, such as painting or photography.
  • How do the classifications of pieces vary? Is the MoMA mainly a painting museum or a sculpture museum or what?
  • Which years of art are the most represented?

We will focus on the first three questions in this lecture since those will require us to do the most joining and reshaping.

Tidy data

‘Tidy’ is Hadley Wikham’s shorthand for putting data into the best structure for analysis in R. There are three requirements for a tidy dataset:

  • Every observation has its own row.
  • Every variable has its own column.
  • Every cell has a single value.

This is a tidy dataset, from R for Data Science

Read the tidy data chapter in R for Data Science. It has many more helpful pictures and some example data to clarify what tidy data is and isn’t. These notes will assume you have read the tidy section.

We will focus on

  • interpreting the concept of tidy data relative to your analysis goal
  • applying some of the tools to the MoMA data which brings up a few challenges.

What will we need the data to look like?

We need information from both the art and artist datasets, and the structures of the dataset we need will depend on the question we ask. Exactly what a tidy dataset looks like depends on what you will consider observations and values in your analysis.

Often it is obvious what you should do, but our situation shows you can need datasets with different structures to answer different questions:

Questions one through three we will be able to answer with a dataset that has one observation per artist per work. If two artists collaborate, we’ll need to count the piece once for each.

Questions four and five need one observation per piece.

A first look at the MoMA

Paul Signac, Opus 217. Against the Enamel of a Background Rhythmic with Beats and Angles, Tones, and Tints, Portrait of M. Félix Fénéon in 1890, 1890 MoMA

Paul Signac, Opus 217. Against the Enamel of a Background Rhythmic with Beats and Angles, Tones, and Tints, Portrait of M. Félix Fénéon in 1890, 1890 MoMA

We have already read in the data – not directly from the MoMA site but from a slightly cleaned up version.

str(art)
## Classes 'tbl_df', 'tbl' and 'data.frame':    128743 obs. of  23 variables:
##  $ title             : chr  "Ferdinandsbrücke Project, Vienna, Austria, Elevation, preliminary version" "City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard" "Villa near Vienna Project, Outside Vienna, Austria, Elevation" "The Manhattan Transcripts Project, New York, New York, Introductory panel to Episode 1: The Park" ...
##  $ constituentid     : chr  "6210" "7470" "7605" "7056" ...
##  $ date              : chr  "1896" "1987" "1903" "1980" ...
##  $ medium            : chr  "Ink and cut-and-pasted painted pages on paper" "Paint and colored pencil on print" "Graphite, pen, color pencil, ink, and gouache on tracing paper" "Photographic reproduction with colored synthetic laminate" ...
##  $ dimensions        : chr  "19 1/8 x 66 1/2\" (48.6 x 168.9 cm)" "16 x 11 3/4\" (40.6 x 29.8 cm)" "13 1/2 x 12 1/2\" (34.3 x 31.8 cm)" "20 x 20\" (50.8 x 50.8 cm)" ...
##  $ creditline        : chr  "Fractional and promised gift of Jo Carole and Ronald S. Lauder" "Gift of the architect in honor of Lily Auchincloss" "Gift of Jo Carole and Ronald S. Lauder" "Purchase and partial gift of the architect in honor of Lily Auchincloss" ...
##  $ accessionnumber   : chr  "885.1996" "1.1995" "1.1997" "2.1995" ...
##  $ classification    : chr  "Architecture" "Architecture" "Architecture" "Architecture" ...
##  $ department        : chr  "Architecture & Design" "Architecture & Design" "Architecture & Design" "Architecture & Design" ...
##  $ dateacquired      : Date, format: "1996-04-09" "1995-01-17" ...
##  $ cataloged         : chr  "Y" "Y" "Y" "Y" ...
##  $ objectid          : chr  "2" "3" "4" "5" ...
##  $ url               : chr  "http://www.moma.org/collection/works/2" "http://www.moma.org/collection/works/3" "http://www.moma.org/collection/works/4" "http://www.moma.org/collection/works/5" ...
##  $ thumbnailurl      : chr  "http://www.moma.org/media/W1siZiIsIjU5NDA1Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=137b8455b1ec6167" "http://www.moma.org/media/W1siZiIsIjk3Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=55b65fa4368fe00a" "http://www.moma.org/media/W1siZiIsIjk4Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=fdcfca4db3acac1f" "http://www.moma.org/media/W1siZiIsIjEyNCJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=c89b9071486760a5" ...
##  $ circumference (cm): chr  NA NA NA NA ...
##  $ depth (cm)        : chr  NA NA NA NA ...
##  $ diameter (cm)     : chr  NA NA NA NA ...
##  $ height (cm)       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ length (cm)       : chr  NA NA NA NA ...
##  $ weight (kg)       : chr  NA NA NA NA ...
##  $ width (cm)        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ seat height (cm)  : chr  NA NA NA NA ...
##  $ duration (sec.)   : chr  NA NA NA NA ...

Some obstacles

The art dataset is not tidy enough for our purposes. Some pieces have more than one artist listed in the constituentid field—which should be a unique artist identifier. An example is objectid 145126, a live performance by Ben Vautier and `various artists.’ Others have specific artists listed in collaborations.

We will have to give in and make a caveat already: We can’t know who ‘various artists’ refers to, so we will have to handle that somehow in our analysis, and there are not many good options. For example, once we split the records by artist, we can filter the ‘various artists’ pieces using the constituentid 24409.

Come up with a plan

With a set of questions we want to answer, we can form a plan to make a tidy dataset that has all the information we need to answer the first three questions above. Our plan is to

  • separate the unique artist identifier, constituentid, into different columns, so we have one value per cell and not multiple.
  • gather the new artist id columns we just created into a single column, removing the individual artist id columns we created before and leaving two new columns: One marking which number collaborator the artist was on the project, say the second artist, and the other giving that artist’s unique identifier. That is what we will see is called a key-value pair—which the textbook chapter talks about in detail.

That gives us one record per artist per piece. In the next section, on relational data, we will

  • left_join the artist data to the art data using the unique artist identifier.

That leaves us with a tidy dataset tying artist biographical information to the art pieces they worked on—as we need it to be.

Separating variables to get one value per cell

Some of the constituentid values contain multiple values violating one entry per cell

art$constituentid[110:120]
##  [1] "27"               "8102, 6703, 4312" "8102, 6703, 4312"
##  [4] "8102, 4312, 6703" "8102, 6703, 4312" "2702"            
##  [7] "2702"             "8219"             "2702"            
## [10] "8219"             "8101"

The separate function in tidyr is built to fix this. It will take a data frame, a column name, a character on which to split each cell, and a character vector of new column names.

You might have guessed something funny was up with the constituentid field. The str command above showed the variable appeared to be numeric but in fact was a character.

Different ids within cells seem separated by a comma and a space. We will use those to split the variable into first and second artist columns. If there are more than two artists, we will repeat the process.

With a dataset of any reasonable size, it can sometimes be helpful to do your reshaping on small subset including the problems you’re trying to fix. That way you can assure yourself it worked not just with spot checks or summarisation but with a visual check. Ultimately, you will want a more systematic check that everything is as it should be, which you can apply to the whole dataset.

Let’s try separating on a subset first. Note the grepl tells you if a string matches a certain pattern.

Warning: these notes use some string manipulation commands called regular expressions which we will cover later. If you want to read ahead checkout using regular expressions in R.

# get a small subset of the data to play with
# restrict ourselves to rows whose constituentid contains a comma
test <- filter(art, grepl(',', constituentid)) %>% 
            slice(10:50)

Let’s take a first crack at the problem

# first try to fix the problem
separate(test, col = constituentid, into = c('artist1', 'artist2'), sep = ",")
## Warning: Expected 2 pieces. Additional pieces discarded in 31 rows [1, 2,
## 3, 4, 5, 6, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, ...].
## # A tibble: 41 x 24
##    title artist1 artist2 date  medium dimensions creditline accessionnumber
##    <chr> <chr>   <chr>   <chr> <chr>  <chr>      <chr>      <chr>          
##  1 Egg … 6956    " 7559" 1975  Ink a… "44 1/2 x… Gift of K… 110.1992       
##  2 City… 8213    " 6132" 1965  Graph… "20 1/2 x… Gift of V… 120.1988       
##  3 Hous… 8213    " 6132" 1978  Felt-… "12 x 23 … Gift of V… 121.1988       
##  4 Gord… 8215    " 6132" 1981  Felt-… "12 x 15 … Gift of V… 122.1988       
##  5 Gord… 8215    " 6132" 1981  Graph… "12 x 13 … Gift of V… 123.1988       
##  6 Lewi… 8215    " 6132" 1983  Felt-… 12 x 28 (… Gift of V… 124.1988       
##  7 Cont… 6076    " 8112" 1923  Gouac… "22 1/2 x… Gift of E… 149.1947       
##  8 Ibir… 6934    " 8103" 1953  Gouac… "39 1/2  … Gift of R… 157.1991       
##  9 Ibir… 6934    " 8103" 1953  Gouac… "39 x 59 … Gift of R… 158.1991       
## 10 Sixt… 8218    " 2288… 1987  Serig… "40 x 30\… Given ano… 272.1999.1     
## # ... with 31 more rows, and 16 more variables: classification <chr>,
## #   department <chr>, dateacquired <date>, cataloged <chr>,
## #   objectid <chr>, url <chr>, thumbnailurl <chr>, `circumference
## #   (cm)` <chr>, `depth (cm)` <chr>, `diameter (cm)` <chr>, `height
## #   (cm)` <dbl>, `length (cm)` <chr>, `weight (kg)` <chr>, `width
## #   (cm)` <dbl>, `seat height (cm)` <chr>, `duration (sec.)` <chr>

Oops. We told separate to split the cells every time there was a column, but we gave it only two destination columns. Works with more artists than that saw their extra values dropped by default.

We have two options: If we know the maximum number of columns we will need, we can set that to be the number of columns we separate into, filling un-needed columns with NA. Alternatively, we could do our splitting column by column.

The second option is more cumbersome, so let’s figure out how many columns we need. Here’s one way to do it, using the stringr package for manipulating character vectors. str_replace_all here removes all occurrences of non-commas, nchar tells us how many characters are in each string, and max returns the maximum number.

str_replace_all(art$constituentid, "[^,]", "") %>% nchar %>% max
## [1] 30

This approach comes at a cost: We will have to create 30 more columns for a total of 31 possible artist ids—since the number of commas plus one is the number of artists. But that’s not exactly even what we want. We want one row per artist per piece, which we will do later.

For now, let’s run separate on our test dataset, using only the maximum number of columns we need there. We also specify that we want separate to fill in NA values for missing values on the right.

# how many columns to add
n <- str_replace_all(test$constituentid, "[^,]", "") %>% nchar %>% max + 1

# maybe this will work...
test <- separate(test, col = constituentid, into = paste0("artistid", c(1:n)), sep = ",", fill = "right")

# print out new artistid columns
select(test, contains("artistid"))
## # A tibble: 41 x 5
##    artistid1 artistid2 artistid3 artistid4 artistid5
##    <chr>     <chr>     <chr>     <chr>     <chr>    
##  1 6956      " 7559"   " 7558"   " 6957"   <NA>     
##  2 8213      " 6132"   " 8214"   <NA>      <NA>     
##  3 8213      " 6132"   " 8214"   " 8216"   <NA>     
##  4 8215      " 6132"   " 8214"   " 8216"   <NA>     
##  5 8215      " 6132"   " 8214"   " 8216"   <NA>     
##  6 8215      " 6132"   " 8214"   " 8216"   <NA>     
##  7 6076      " 8112"   <NA>      <NA>      <NA>     
##  8 6934      " 8103"   <NA>      <NA>      <NA>     
##  9 6934      " 8103"   <NA>      <NA>      <NA>     
## 10 8218      " 22884"  <NA>      <NA>      <NA>     
## # ... with 31 more rows

Once you’ve looked at the test data and come up with some good automatic checks if needed to be sure it worked, you can run it on the full data and re-run those checks.

# apply the solution to the full data set
n <- str_replace_all(art$constituentid, "[^,]", "") %>% nchar %>% max + 1
art <- separate(art, col = constituentid, into = paste0("artistid", c(1:n)), sep = ",", fill = "right")

Gathering and spreading: One observation per row, one variable per column

art now has 30 extra columns

select(art, contains("artistid"))
## # A tibble: 128,743 x 31
##    artistid1 artistid2 artistid3 artistid4 artistid5 artistid6 artistid7
##    <chr>     <chr>     <chr>     <chr>     <chr>     <chr>     <chr>    
##  1 6210      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
##  2 7470      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
##  3 7605      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
##  4 7056      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
##  5 7605      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
##  6 7056      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
##  7 7056      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
##  8 7056      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
##  9 7056      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
## 10 7056      <NA>      <NA>      <NA>      <NA>      <NA>      <NA>     
## # ... with 128,733 more rows, and 24 more variables: artistid8 <chr>,
## #   artistid9 <chr>, artistid10 <chr>, artistid11 <chr>, artistid12 <chr>,
## #   artistid13 <chr>, artistid14 <chr>, artistid15 <chr>,
## #   artistid16 <chr>, artistid17 <chr>, artistid18 <chr>,
## #   artistid19 <chr>, artistid20 <chr>, artistid21 <chr>,
## #   artistid22 <chr>, artistid23 <chr>, artistid24 <chr>,
## #   artistid25 <chr>, artistid26 <chr>, artistid27 <chr>,
## #   artistid28 <chr>, artistid29 <chr>, artistid30 <chr>, artistid31 <chr>

Our data frame has one value per cell, but it still only has one observation per piece. That’s not our unit of analysis for this project. We want one observation per piece per artist.

We need to gather those artist id columns into key-value pairs. As the R for Data Science chapter puts it, gathering “collects a set of column names and places them into a single “key” column”. It also collects the field of cells associated with those columns and places them into a single value column.’

gather(`1999`, `2000`, key = "year", value = "cases")
Gathering columns into key-value pairs, R for Data Science

Gathering columns into key-value pairs, R for Data Science

gather in the tidyr package gives an easy way to deal with that. See the textbook for details, but the image above tells you what’s going on.

We are going to gather the artistid columns into two new columns: artistnum (key) and id (value). The resulting data frame no longer has the artistid column, but has two new columns (artistnum and id)

gather(test, key = 'artistnum', value = 'id', contains("artistid")) %>% select(objectid, artistnum, id)
## # A tibble: 205 x 3
##    objectid artistnum id   
##    <chr>    <chr>     <chr>
##  1 204      artistid1 6956 
##  2 210      artistid1 8213 
##  3 211      artistid1 8213 
##  4 212      artistid1 8215 
##  5 213      artistid1 8215 
##  6 215      artistid1 8215 
##  7 232      artistid1 6076 
##  8 252      artistid1 6934 
##  9 253      artistid1 6934 
## 10 296      artistid1 8218 
## # ... with 195 more rows

We now have one record per artist per piece—except we also have a number of records for pieces with no artist past a certain number. For example, objectid 448, ‘Video Clip Folly Project,’ has only two collaborators, yet it has five rows in the new gathered dataset.

gather gives an easy way to deal with that, using the na.rm = TRUE option. We’ll go ahead and run this on our full dataset.

# gather artists but kill NAs
art <- gather(art, key = artistnum, value = constituentid, contains("artistid"), na.rm = TRUE)

Inverse of separate and gather

spread does the opposite of gather, putting key-value pairs into columns. We didn’t need that here.

unite is the inverse of separate, combining multiple columns into one.

We can use unite, for example, to check we have only one observation per piece and artist: use the inverse of the separate function, unite, to create artist-piece id pairs in a single variable, then count how many records we have for each one. We only care that none is bigger than 1, so a summary will be fine.

tidy_art <- unite(art, col = id, constituentid, objectid)
tidy_art %>% select(id)
## # A tibble: 142,218 x 1
##    id     
##    <chr>  
##  1 6210_2 
##  2 7470_3 
##  3 7605_4 
##  4 7056_5 
##  5 7605_6 
##  6 7056_7 
##  7 7056_8 
##  8 7056_9 
##  9 7056_10
## 10 7056_11
## # ... with 142,208 more rows

In theory we would be done, but there are a few more data quality issues. At this point in the notes we have covered the core tidy content. If you want to see some more real world examples keep reading.

You can check that tidy_art is what you want it to be as follows

unite(art, col = id, constituentid, objectid) %>% count(id) %>% summary
##       id                  n    
##  Length:142211      Min.   :1  
##  Class :character   1st Qu.:1  
##  Mode  :character   Median :1  
##                     Mean   :1  
##                     3rd Qu.:1  
##                     Max.   :2

More problems

There are seven observations with two artist-piece pairs. So we have to see what’s up.

id_count <- unite(art, col = id, constituentid, objectid) %>% count(id)
errors <- filter(id_count, n > 1) %>% separate(id, c("constituentid", "objectid"), sep = "_")

filter(art, constituentid %in% errors$constituentid, objectid %in% errors$objectid) %>% 
  arrange(constituentid, objectid)
## # A tibble: 14 x 24
##    title date  medium dimensions creditline accessionnumber classification
##    <chr> <chr> <chr>  <chr>      <chr>      <chr>           <chr>         
##  1 Max … 1921  Palla… "7 1/2 x … Purchase   577.1971        Photograph    
##  2 Max … 1921  Palla… "7 1/2 x … Purchase   577.1971        Photograph    
##  3 Einf… 1984  Artis… "page (ea… Committee… 550.2015        Illustrated_B…
##  4 Einf… 1984  Artis… "page (ea… Committee… 550.2015        Illustrated_B…
##  5 Rubb… 1992  7-inc… "cover: 7… Acquired … 1089.2014.58.x… Audio         
##  6 Rubb… 1992  7-inc… "cover: 7… Acquired … 1089.2014.58.x… Audio         
##  7 Sür … 1998  CD     "overall:… Acquired … 1089.2014.65    Audio         
##  8 Sür … 1998  CD     "overall:… Acquired … 1089.2014.65    Audio         
##  9 Il P… 1938  Graph… "9 7/8 x … Purchase   324.1984        Architecture  
## 10 Il P… 1938  Graph… "9 7/8 x … Purchase   324.1984        Architecture  
## 11 Il P… 1938  Graph… "9 7/8 x … Purchase   324.1984        Architecture  
## 12 Il P… 1938  Graph… "9 7/8 x … Purchase   324.1984        Architecture  
## 13 ANNI… 1910  Carbo… "9 1/16 x… Gift of M… 361.1961        Photograph    
## 14 ANNI… 1910  Carbo… "9 1/16 x… Gift of M… 361.1961        Photograph    
## # ... with 17 more variables: department <chr>, dateacquired <date>,
## #   cataloged <chr>, objectid <chr>, url <chr>, thumbnailurl <chr>,
## #   `circumference (cm)` <chr>, `depth (cm)` <chr>, `diameter (cm)` <chr>,
## #   `height (cm)` <dbl>, `length (cm)` <chr>, `weight (kg)` <chr>, `width
## #   (cm)` <dbl>, `seat height (cm)` <chr>, `duration (sec.)` <chr>,
## #   artistnum <chr>, constituentid <chr>

Looking at the artist name field tells you the answer: The artists were listed twice! You can reload the original dataset to confirm the constituentid values in fact were repeated. You can see the entire dataset using the View function in RStudio.

We were lucky this was a small problem with an easy solution—to use the dplyr function distinct to keep only one value per constituentid and objectid pair. The rest we know now are duplicates. You can check to be sure your data frame loses only seven rows.

art <- distinct(art, constituentid, objectid, .keep_all = TRUE)

You could also go back and correct that problem at the beginning, but that’s a little bit more tricky.

Relational data and joining tables

Otto Wagner, Ferdinandsbr&uucke Project, Vienna, Austria, Elevation, preliminary version, 1896 MoMA

Otto Wagner, Ferdinandsbr&uucke Project, Vienna, Austria, Elevation, preliminary version, 1896 MoMA

Relational data just refers to multiple datasets that are related by some of the information in them. In the MoMA data, the relation of art pieces to artists is by the unique artist identifier code—constituentid in the art data and id in the artist data.

Typically no one dataset contains all the information you need, so you have to join them using one or more variables that define the relations between the datasets.

R for Data Science describes different kinds of joins. Here, we only care about what the book calls mutating joins. Those take columns from one dataset and add them to another, matching rows based on a set of criteria defining their relationship.

Read the textbook chapter or refer back to it when you need to below. Let’s get started.

Left join: artists to artworks

As often is the case, information we want for the MoMA project is not all in the same place. We need to join the artwork dataset we just cleaned up to a list of artists’ biographical information—already tidy for what we want to do.

To do so, we need a left join. We will use one or more key variables common to both datasets. A left join attempts to match rows in one dataset—the `left’ one—to rows in a second dataset on that key or set of keys.

Left, right and full joins, from R for Data Science

It will return all rows of the left dataset and all columns from the right dataset. Rows on the left with no match on the right will have NA values in the newly joined columns.

There are several types of joins, which you can read about in detail in the textbook. Joins are not just something you can do in R. They are bread and butter for SQL, a common tool for handling relational data.

In R, the dplyr function left_join is what we need. We will give it two datasets and a list of key variables with which to join them.

Our key variable will be the artist id, which is called constituentid' in the art data frame and 'id' in the artist data frame. Theby` argument for key variables will tell the function those two are the same. If they were named the same thing, we could enter just one of them.

As before, let’s look at a small version—this time for D.C. native Alma Thomas.

Alma Thomas, Fiery Sunset, 1973 MoMA

Alma Thomas, Fiery Sunset, 1973 MoMA

test <- filter(art, constituentid == 47098)

left_join(test, artists, by = c("constituentid" = "id"))
## # A tibble: 2 x 32
##   title date  medium dimensions creditline accessionnumber classification
##   <chr> <chr> <chr>  <chr>      <chr>      <chr>           <chr>         
## 1 Unti… c. 1… Synth… "19 1/8 x… Gift of D… 299.2015        Drawing       
## 2 Fier… 1973  Synth… "41 1/4 x… Committee… 505.2015        Painting      
## # ... with 25 more variables: department <chr>, dateacquired <date>,
## #   cataloged <chr>, objectid <chr>, url <chr>, thumbnailurl <chr>,
## #   `circumference (cm)` <chr>, `depth (cm)` <chr>, `diameter (cm)` <chr>,
## #   `height (cm)` <dbl>, `length (cm)` <chr>, `weight (kg)` <chr>, `width
## #   (cm)` <dbl>, `seat height (cm)` <chr>, `duration (sec.)` <chr>,
## #   artistnum <chr>, constituentid <chr>, displayname <chr>,
## #   artistbio <chr>, nationality <chr>, gender <chr>, begindate <int>,
## #   enddate <int>, `wiki qid` <chr>, ulan <int>

To see the full data frame in RStudio, type %>% View after the left join command. We get back a data frame with the two rows representing Alma Thomas’s work at the MoMA, plus the columns of basic biographical information from the artists dataset.

As an aside, notice that we might have a data quality problem. Thomas’s nationality is missing, but her short bio variable lists it as American. We will not worry about that in this lecture.

Let’s do the full data frame.

art <- left_join(art, artists, by = c("constituentid" = "id")) 

select(art, constituentid, displayname, title)
## # A tibble: 142,211 x 3
##    constituentid displayname        title                                 
##    <chr>         <chr>              <chr>                                 
##  1 6210          Otto Wagner        Ferdinandsbrücke Project, Vienna, Aus…
##  2 7470          Christian de Port… City of Music, National Superior Cons…
##  3 7605          Emil Hoppe         Villa near Vienna Project, Outside Vi…
##  4 7056          Bernard Tschumi    The Manhattan Transcripts Project, Ne…
##  5 7605          Emil Hoppe         Villa, project, outside Vienna, Austr…
##  6 7056          Bernard Tschumi    The Manhattan Transcripts Project, Ne…
##  7 7056          Bernard Tschumi    The Manhattan Transcripts Project, Ne…
##  8 7056          Bernard Tschumi    The Manhattan Transcripts Project, Ne…
##  9 7056          Bernard Tschumi    The Manhattan Transcripts Project, Ne…
## 10 7056          Bernard Tschumi    The Manhattan Transcripts Project, Ne…
## # ... with 142,201 more rows