This lecture covers tidy data. The main reference is
The focus of this lecture is how to reshape datasets to be ready for analysis using functions from the tidyr
package package - separate
- unite
- gather
- spread
The Museum of Modern Art (MoMA) publicly releases data about its entire collection (see MoMA collection). This lecture uses a couple data files from their collection.
We first do some initial data cleaning to make life easier
library(tidyverse)
library(stringr) # this should come with tidyverse, but code wont run without import
# load data
artists <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/moma_artists_jan2017.csv')
art <- read_csv('https://raw.githubusercontent.com/idc9/stor390/master/data/moma_art_jan2017.csv')
# Annoying weird issue with first column name
colnames(art) <- c('title', tolower(colnames(art))[-1])
colnames(artists) <- c('id', tolower(colnames(artists))[-1])
# To give us something to do, removing artist information leaving only the artist uniqueid (constituentid) for joining
art <- select(art, -artist, -artistbio, -nationality, -begindate, -enddate, -gender)
# Removing artwork with NA constituent ID. Lots of NA values in their entries, and not worth finding out what's up for this demonstration.
# There are
art <- filter(art, !is.na(constituentid))
art <- mutate(art,
classification = str_replace_all(classification, "[[:punct:]]", ""),
classification = str_replace_all(classification, "\\s", "_"),
constituentid = str_trim(constituentid, side = "both"),
objectid = str_trim(objectid, side = "both"))
artists <- mutate(artists,
id = str_trim(id, side = "both"))
# For some reason, certain artworks list the same constituentid twice. Not worth explaining in the lecture.
# We correct this in the lecture, but you could correct it here with the code below
# art$constituentid <- str_split(art$constituentid, ",") %>% lapply(., function(x){unique(x) %>% paste(., collapse = ",")}) %>% unlist
These notes instead will walk through how to apply those concepts on our sample dataset for the lecture. The goal is to demonstrate how you might apply tidy and relational data concepts, and to show some issues you could run into.
If you do something cool with this dataset let the instructor know. The MoMA is interested in showing off what people do with it Show in New WindowClear OutputExpand/Collapse Output [1] “27” “8102, 6703, 4312” “8102, 6703, 4312” “8102, 4312, 6703” “8102, 6703, 4312” [6] “2702” “2702” “8219” “2702” “8219”
[11] “8101”
Show in New WindowClear OutputExpand/Collapse Output Show Traceback Error in eval(expr, envir, enclos) : object ‘constituentid’ not found Show in New WindowClear OutputExpand/Collapse Output Too many values at 31 locations: 1, 2, 3, 4, 5, 6, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, … R Console
1
2
3
4
5
6
7
8
9
10
Next12345Previous 1-10 of 41 rows | 1-1 of 24 columns tbl_df 41 x 24
1
2
3
4
5
6
7
8
9
10
Next12345Previous 1-10 of 41 rows | 1-1 of 24 columns Show in New WindowClear OutputExpand/Collapse Output [1] 30 Show in New WindowClear OutputExpand/Collapse Output
artistid1
The Museum of Modern Art in New York published datasets of the art and artists in its collection on GitHub.
the art dataset has almost 130,000 records, one for each work and includes information about the piece title, the artists responsible, its physical properties and a classification into art types, such as painting.
the artist dataset has one record per artist, with basic biographical information.
Datasets are updated regularly: The ones we will use are from the first week of July, 2016. See the link to MoMA’s GitHub page above for the museum’s request for how to use the data.
Any project starts with a series of preliminary questions, things we want to know about the data.
Ones we will focus on are listed below. You could think of a number of others.
We will focus on the first three questions in this lecture since those will require us to do the most joining and reshaping.
‘Tidy’ is Hadley Wikham’s shorthand for putting data into the best structure for analysis in R. There are three requirements for a tidy dataset:
This is a tidy dataset, from R for Data Science
Read the tidy data chapter in R for Data Science. It has many more helpful pictures and some example data to clarify what tidy data is and isn’t. These notes will assume you have read the tidy section.
We will focus on
We need information from both the art and artist datasets, and the structures of the dataset we need will depend on the question we ask. Exactly what a tidy dataset looks like depends on what you will consider observations and values in your analysis.
Often it is obvious what you should do, but our situation shows you can need datasets with different structures to answer different questions:
Questions one through three we will be able to answer with a dataset that has one observation per artist per work. If two artists collaborate, we’ll need to count the piece once for each.
Questions four and five need one observation per piece.
We have already read in the data – not directly from the MoMA site but from a slightly cleaned up version.
str(art)
## Classes 'tbl_df', 'tbl' and 'data.frame': 128743 obs. of 23 variables:
## $ title : chr "Ferdinandsbrücke Project, Vienna, Austria, Elevation, preliminary version" "City of Music, National Superior Conservatory of Music and Dance, Paris, France, View from interior courtyard" "Villa near Vienna Project, Outside Vienna, Austria, Elevation" "The Manhattan Transcripts Project, New York, New York, Introductory panel to Episode 1: The Park" ...
## $ constituentid : chr "6210" "7470" "7605" "7056" ...
## $ date : chr "1896" "1987" "1903" "1980" ...
## $ medium : chr "Ink and cut-and-pasted painted pages on paper" "Paint and colored pencil on print" "Graphite, pen, color pencil, ink, and gouache on tracing paper" "Photographic reproduction with colored synthetic laminate" ...
## $ dimensions : chr "19 1/8 x 66 1/2\" (48.6 x 168.9 cm)" "16 x 11 3/4\" (40.6 x 29.8 cm)" "13 1/2 x 12 1/2\" (34.3 x 31.8 cm)" "20 x 20\" (50.8 x 50.8 cm)" ...
## $ creditline : chr "Fractional and promised gift of Jo Carole and Ronald S. Lauder" "Gift of the architect in honor of Lily Auchincloss" "Gift of Jo Carole and Ronald S. Lauder" "Purchase and partial gift of the architect in honor of Lily Auchincloss" ...
## $ accessionnumber : chr "885.1996" "1.1995" "1.1997" "2.1995" ...
## $ classification : chr "Architecture" "Architecture" "Architecture" "Architecture" ...
## $ department : chr "Architecture & Design" "Architecture & Design" "Architecture & Design" "Architecture & Design" ...
## $ dateacquired : Date, format: "1996-04-09" "1995-01-17" ...
## $ cataloged : chr "Y" "Y" "Y" "Y" ...
## $ objectid : chr "2" "3" "4" "5" ...
## $ url : chr "http://www.moma.org/collection/works/2" "http://www.moma.org/collection/works/3" "http://www.moma.org/collection/works/4" "http://www.moma.org/collection/works/5" ...
## $ thumbnailurl : chr "http://www.moma.org/media/W1siZiIsIjU5NDA1Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=137b8455b1ec6167" "http://www.moma.org/media/W1siZiIsIjk3Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=55b65fa4368fe00a" "http://www.moma.org/media/W1siZiIsIjk4Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=fdcfca4db3acac1f" "http://www.moma.org/media/W1siZiIsIjEyNCJdLFsicCIsImNvbnZlcnQiLCItcmVzaXplIDMwMHgzMDBcdTAwM2UiXV0.jpg?sha=c89b9071486760a5" ...
## $ circumference (cm): chr NA NA NA NA ...
## $ depth (cm) : chr NA NA NA NA ...
## $ diameter (cm) : chr NA NA NA NA ...
## $ height (cm) : num NA NA NA NA NA NA NA NA NA NA ...
## $ length (cm) : chr NA NA NA NA ...
## $ weight (kg) : chr NA NA NA NA ...
## $ width (cm) : num NA NA NA NA NA NA NA NA NA NA ...
## $ seat height (cm) : chr NA NA NA NA ...
## $ duration (sec.) : chr NA NA NA NA ...
The art dataset is not tidy enough for our purposes. Some pieces have more than one artist listed in the constituentid field—which should be a unique artist identifier. An example is objectid 145126, a live performance by Ben Vautier and `various artists.’ Others have specific artists listed in collaborations.
We will have to give in and make a caveat already: We can’t know who ‘various artists’ refers to, so we will have to handle that somehow in our analysis, and there are not many good options. For example, once we split the records by artist, we can filter the ‘various artists’ pieces using the constituentid 24409.
With a set of questions we want to answer, we can form a plan to make a tidy dataset that has all the information we need to answer the first three questions above. Our plan is to
separate
the unique artist identifier, constituentid, into different columns, so we have one value per cell and not multiple.gather
the new artist id columns we just created into a single column, removing the individual artist id columns we created before and leaving two new columns: One marking which number collaborator the artist was on the project, say the second artist, and the other giving that artist’s unique identifier. That is what we will see is called a key-value pair—which the textbook chapter talks about in detail.That gives us one record per artist per piece. In the next section, on relational data, we will
left_join
the artist data to the art data using the unique artist identifier.That leaves us with a tidy dataset tying artist biographical information to the art pieces they worked on—as we need it to be.
Some of the constituentid values contain multiple values violating one entry per cell
art$constituentid[110:120]
## [1] "27" "8102, 6703, 4312" "8102, 6703, 4312"
## [4] "8102, 4312, 6703" "8102, 6703, 4312" "2702"
## [7] "2702" "8219" "2702"
## [10] "8219" "8101"
The separate
function in tidyr
is built to fix this. It will take a data frame, a column name, a character on which to split each cell, and a character vector of new column names.
You might have guessed something funny was up with the constituentid field. The str
command above showed the variable appeared to be numeric but in fact was a character.
Different ids within cells seem separated by a comma and a space. We will use those to split the variable into first and second artist columns. If there are more than two artists, we will repeat the process.
With a dataset of any reasonable size, it can sometimes be helpful to do your reshaping on small subset including the problems you’re trying to fix. That way you can assure yourself it worked not just with spot checks or summarisation but with a visual check. Ultimately, you will want a more systematic check that everything is as it should be, which you can apply to the whole dataset.
Let’s try separating on a subset first. Note the grepl
tells you if a string matches a certain pattern.
Warning: these notes use some string manipulation commands called regular expressions which we will cover later. If you want to read ahead checkout using regular expressions in R.
# get a small subset of the data to play with
# restrict ourselves to rows whose constituentid contains a comma
test <- filter(art, grepl(',', constituentid)) %>%
slice(10:50)
Let’s take a first crack at the problem
# first try to fix the problem
separate(test, col = constituentid, into = c('artist1', 'artist2'), sep = ",")
## Warning: Expected 2 pieces. Additional pieces discarded in 31 rows [1, 2,
## 3, 4, 5, 6, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, ...].
## # A tibble: 41 x 24
## title artist1 artist2 date medium dimensions creditline accessionnumber
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Egg … 6956 " 7559" 1975 Ink a… "44 1/2 x… Gift of K… 110.1992
## 2 City… 8213 " 6132" 1965 Graph… "20 1/2 x… Gift of V… 120.1988
## 3 Hous… 8213 " 6132" 1978 Felt-… "12 x 23 … Gift of V… 121.1988
## 4 Gord… 8215 " 6132" 1981 Felt-… "12 x 15 … Gift of V… 122.1988
## 5 Gord… 8215 " 6132" 1981 Graph… "12 x 13 … Gift of V… 123.1988
## 6 Lewi… 8215 " 6132" 1983 Felt-… 12 x 28 (… Gift of V… 124.1988
## 7 Cont… 6076 " 8112" 1923 Gouac… "22 1/2 x… Gift of E… 149.1947
## 8 Ibir… 6934 " 8103" 1953 Gouac… "39 1/2 … Gift of R… 157.1991
## 9 Ibir… 6934 " 8103" 1953 Gouac… "39 x 59 … Gift of R… 158.1991
## 10 Sixt… 8218 " 2288… 1987 Serig… "40 x 30\… Given ano… 272.1999.1
## # ... with 31 more rows, and 16 more variables: classification <chr>,
## # department <chr>, dateacquired <date>, cataloged <chr>,
## # objectid <chr>, url <chr>, thumbnailurl <chr>, `circumference
## # (cm)` <chr>, `depth (cm)` <chr>, `diameter (cm)` <chr>, `height
## # (cm)` <dbl>, `length (cm)` <chr>, `weight (kg)` <chr>, `width
## # (cm)` <dbl>, `seat height (cm)` <chr>, `duration (sec.)` <chr>
Oops. We told separate to split the cells every time there was a column, but we gave it only two destination columns. Works with more artists than that saw their extra values dropped by default.
We have two options: If we know the maximum number of columns we will need, we can set that to be the number of columns we separate into, filling un-needed columns with NA
. Alternatively, we could do our splitting column by column.
The second option is more cumbersome, so let’s figure out how many columns we need. Here’s one way to do it, using the stringr
package for manipulating character vectors. str_replace_all
here removes all occurrences of non-commas, nchar
tells us how many characters are in each string, and max returns the maximum number.
str_replace_all(art$constituentid, "[^,]", "") %>% nchar %>% max
## [1] 30
This approach comes at a cost: We will have to create 30 more columns for a total of 31 possible artist ids—since the number of commas plus one is the number of artists. But that’s not exactly even what we want. We want one row per artist per piece, which we will do later.
For now, let’s run separate
on our test dataset, using only the maximum number of columns we need there. We also specify that we want separate
to fill in NA
values for missing values on the right.
# how many columns to add
n <- str_replace_all(test$constituentid, "[^,]", "") %>% nchar %>% max + 1
# maybe this will work...
test <- separate(test, col = constituentid, into = paste0("artistid", c(1:n)), sep = ",", fill = "right")
# print out new artistid columns
select(test, contains("artistid"))
## # A tibble: 41 x 5
## artistid1 artistid2 artistid3 artistid4 artistid5
## <chr> <chr> <chr> <chr> <chr>
## 1 6956 " 7559" " 7558" " 6957" <NA>
## 2 8213 " 6132" " 8214" <NA> <NA>
## 3 8213 " 6132" " 8214" " 8216" <NA>
## 4 8215 " 6132" " 8214" " 8216" <NA>
## 5 8215 " 6132" " 8214" " 8216" <NA>
## 6 8215 " 6132" " 8214" " 8216" <NA>
## 7 6076 " 8112" <NA> <NA> <NA>
## 8 6934 " 8103" <NA> <NA> <NA>
## 9 6934 " 8103" <NA> <NA> <NA>
## 10 8218 " 22884" <NA> <NA> <NA>
## # ... with 31 more rows
Once you’ve looked at the test data and come up with some good automatic checks if needed to be sure it worked, you can run it on the full data and re-run those checks.
# apply the solution to the full data set
n <- str_replace_all(art$constituentid, "[^,]", "") %>% nchar %>% max + 1
art <- separate(art, col = constituentid, into = paste0("artistid", c(1:n)), sep = ",", fill = "right")
art now has 30 extra columns
select(art, contains("artistid"))
## # A tibble: 128,743 x 31
## artistid1 artistid2 artistid3 artistid4 artistid5 artistid6 artistid7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 6210 <NA> <NA> <NA> <NA> <NA> <NA>
## 2 7470 <NA> <NA> <NA> <NA> <NA> <NA>
## 3 7605 <NA> <NA> <NA> <NA> <NA> <NA>
## 4 7056 <NA> <NA> <NA> <NA> <NA> <NA>
## 5 7605 <NA> <NA> <NA> <NA> <NA> <NA>
## 6 7056 <NA> <NA> <NA> <NA> <NA> <NA>
## 7 7056 <NA> <NA> <NA> <NA> <NA> <NA>
## 8 7056 <NA> <NA> <NA> <NA> <NA> <NA>
## 9 7056 <NA> <NA> <NA> <NA> <NA> <NA>
## 10 7056 <NA> <NA> <NA> <NA> <NA> <NA>
## # ... with 128,733 more rows, and 24 more variables: artistid8 <chr>,
## # artistid9 <chr>, artistid10 <chr>, artistid11 <chr>, artistid12 <chr>,
## # artistid13 <chr>, artistid14 <chr>, artistid15 <chr>,
## # artistid16 <chr>, artistid17 <chr>, artistid18 <chr>,
## # artistid19 <chr>, artistid20 <chr>, artistid21 <chr>,
## # artistid22 <chr>, artistid23 <chr>, artistid24 <chr>,
## # artistid25 <chr>, artistid26 <chr>, artistid27 <chr>,
## # artistid28 <chr>, artistid29 <chr>, artistid30 <chr>, artistid31 <chr>
Our data frame has one value per cell, but it still only has one observation per piece. That’s not our unit of analysis for this project. We want one observation per piece per artist.
We need to gather those artist id columns into key-value pairs. As the R for Data Science chapter puts it, gathering “collects a set of column names and places them into a single “key” column”. It also collects the field of cells associated with those columns and places them into a single value column.’
gather(`1999`, `2000`, key = "year", value = "cases")
gather
in the tidyr
package gives an easy way to deal with that. See the textbook for details, but the image above tells you what’s going on.
We are going to gather the artistid columns into two new columns: artistnum (key) and id (value). The resulting data frame no longer has the artistid column, but has two new columns (artistnum and id)
gather(test, key = 'artistnum', value = 'id', contains("artistid")) %>% select(objectid, artistnum, id)
## # A tibble: 205 x 3
## objectid artistnum id
## <chr> <chr> <chr>
## 1 204 artistid1 6956
## 2 210 artistid1 8213
## 3 211 artistid1 8213
## 4 212 artistid1 8215
## 5 213 artistid1 8215
## 6 215 artistid1 8215
## 7 232 artistid1 6076
## 8 252 artistid1 6934
## 9 253 artistid1 6934
## 10 296 artistid1 8218
## # ... with 195 more rows
We now have one record per artist per piece—except we also have a number of records for pieces with no artist past a certain number. For example, objectid 448, ‘Video Clip Folly Project,’ has only two collaborators, yet it has five rows in the new gathered dataset.
gather
gives an easy way to deal with that, using the na.rm = TRUE
option. We’ll go ahead and run this on our full dataset.
# gather artists but kill NAs
art <- gather(art, key = artistnum, value = constituentid, contains("artistid"), na.rm = TRUE)
spread
does the opposite of gather
, putting key-value pairs into columns. We didn’t need that here.
unite
is the inverse of separate
, combining multiple columns into one.
We can use unite
, for example, to check we have only one observation per piece and artist: use the inverse of the separate
function, unite
, to create artist-piece id pairs in a single variable, then count how many records we have for each one. We only care that none is bigger than 1, so a summary will be fine.
tidy_art <- unite(art, col = id, constituentid, objectid)
tidy_art %>% select(id)
## # A tibble: 142,218 x 1
## id
## <chr>
## 1 6210_2
## 2 7470_3
## 3 7605_4
## 4 7056_5
## 5 7605_6
## 6 7056_7
## 7 7056_8
## 8 7056_9
## 9 7056_10
## 10 7056_11
## # ... with 142,208 more rows
In theory we would be done, but there are a few more data quality issues. At this point in the notes we have covered the core tidy content. If you want to see some more real world examples keep reading.
You can check that tidy_art is what you want it to be as follows
unite(art, col = id, constituentid, objectid) %>% count(id) %>% summary
## id n
## Length:142211 Min. :1
## Class :character 1st Qu.:1
## Mode :character Median :1
## Mean :1
## 3rd Qu.:1
## Max. :2
There are seven observations with two artist-piece pairs. So we have to see what’s up.
id_count <- unite(art, col = id, constituentid, objectid) %>% count(id)
errors <- filter(id_count, n > 1) %>% separate(id, c("constituentid", "objectid"), sep = "_")
filter(art, constituentid %in% errors$constituentid, objectid %in% errors$objectid) %>%
arrange(constituentid, objectid)
## # A tibble: 14 x 24
## title date medium dimensions creditline accessionnumber classification
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Max … 1921 Palla… "7 1/2 x … Purchase 577.1971 Photograph
## 2 Max … 1921 Palla… "7 1/2 x … Purchase 577.1971 Photograph
## 3 Einf… 1984 Artis… "page (ea… Committee… 550.2015 Illustrated_B…
## 4 Einf… 1984 Artis… "page (ea… Committee… 550.2015 Illustrated_B…
## 5 Rubb… 1992 7-inc… "cover: 7… Acquired … 1089.2014.58.x… Audio
## 6 Rubb… 1992 7-inc… "cover: 7… Acquired … 1089.2014.58.x… Audio
## 7 Sür … 1998 CD "overall:… Acquired … 1089.2014.65 Audio
## 8 Sür … 1998 CD "overall:… Acquired … 1089.2014.65 Audio
## 9 Il P… 1938 Graph… "9 7/8 x … Purchase 324.1984 Architecture
## 10 Il P… 1938 Graph… "9 7/8 x … Purchase 324.1984 Architecture
## 11 Il P… 1938 Graph… "9 7/8 x … Purchase 324.1984 Architecture
## 12 Il P… 1938 Graph… "9 7/8 x … Purchase 324.1984 Architecture
## 13 ANNI… 1910 Carbo… "9 1/16 x… Gift of M… 361.1961 Photograph
## 14 ANNI… 1910 Carbo… "9 1/16 x… Gift of M… 361.1961 Photograph
## # ... with 17 more variables: department <chr>, dateacquired <date>,
## # cataloged <chr>, objectid <chr>, url <chr>, thumbnailurl <chr>,
## # `circumference (cm)` <chr>, `depth (cm)` <chr>, `diameter (cm)` <chr>,
## # `height (cm)` <dbl>, `length (cm)` <chr>, `weight (kg)` <chr>, `width
## # (cm)` <dbl>, `seat height (cm)` <chr>, `duration (sec.)` <chr>,
## # artistnum <chr>, constituentid <chr>
Looking at the artist name field tells you the answer: The artists were listed twice! You can reload the original dataset to confirm the constituentid values in fact were repeated. You can see the entire dataset using the View function in RStudio.
We were lucky this was a small problem with an easy solution—to use the dplyr
function distinct
to keep only one value per constituentid and objectid pair. The rest we know now are duplicates. You can check to be sure your data frame loses only seven rows.
art <- distinct(art, constituentid, objectid, .keep_all = TRUE)
You could also go back and correct that problem at the beginning, but that’s a little bit more tricky.
Relational data just refers to multiple datasets that are related by some of the information in them. In the MoMA data, the relation of art pieces to artists is by the unique artist identifier code—constituentid in the art data and id in the artist data.
Typically no one dataset contains all the information you need, so you have to join them using one or more variables that define the relations between the datasets.
R for Data Science describes different kinds of joins. Here, we only care about what the book calls mutating joins. Those take columns from one dataset and add them to another, matching rows based on a set of criteria defining their relationship.
Read the textbook chapter or refer back to it when you need to below. Let’s get started.
As often is the case, information we want for the MoMA project is not all in the same place. We need to join the artwork dataset we just cleaned up to a list of artists’ biographical information—already tidy for what we want to do.
To do so, we need a left join. We will use one or more key variables common to both datasets. A left join attempts to match rows in one dataset—the `left’ one—to rows in a second dataset on that key or set of keys.
Left, right and full joins, from R for Data Science
It will return all rows of the left dataset and all columns from the right dataset. Rows on the left with no match on the right will have NA
values in the newly joined columns.
There are several types of joins, which you can read about in detail in the textbook. Joins are not just something you can do in R. They are bread and butter for SQL, a common tool for handling relational data.
In R, the dplyr
function left_join
is what we need. We will give it two datasets and a list of key variables with which to join them.
Our key variable will be the artist id, which is called constituentid' in the art data frame and 'id' in the artist data frame. The
by` argument for key variables will tell the function those two are the same. If they were named the same thing, we could enter just one of them.
As before, let’s look at a small version—this time for D.C. native Alma Thomas.
test <- filter(art, constituentid == 47098)
left_join(test, artists, by = c("constituentid" = "id"))
## # A tibble: 2 x 32
## title date medium dimensions creditline accessionnumber classification
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Unti… c. 1… Synth… "19 1/8 x… Gift of D… 299.2015 Drawing
## 2 Fier… 1973 Synth… "41 1/4 x… Committee… 505.2015 Painting
## # ... with 25 more variables: department <chr>, dateacquired <date>,
## # cataloged <chr>, objectid <chr>, url <chr>, thumbnailurl <chr>,
## # `circumference (cm)` <chr>, `depth (cm)` <chr>, `diameter (cm)` <chr>,
## # `height (cm)` <dbl>, `length (cm)` <chr>, `weight (kg)` <chr>, `width
## # (cm)` <dbl>, `seat height (cm)` <chr>, `duration (sec.)` <chr>,
## # artistnum <chr>, constituentid <chr>, displayname <chr>,
## # artistbio <chr>, nationality <chr>, gender <chr>, begindate <int>,
## # enddate <int>, `wiki qid` <chr>, ulan <int>
To see the full data frame in RStudio, type %>% View
after the left join command. We get back a data frame with the two rows representing Alma Thomas’s work at the MoMA, plus the columns of basic biographical information from the artists
dataset.
As an aside, notice that we might have a data quality problem. Thomas’s nationality is missing, but her short bio variable lists it as American. We will not worry about that in this lecture.
Let’s do the full data frame.
art <- left_join(art, artists, by = c("constituentid" = "id"))
select(art, constituentid, displayname, title)
## # A tibble: 142,211 x 3
## constituentid displayname title
## <chr> <chr> <chr>
## 1 6210 Otto Wagner Ferdinandsbrücke Project, Vienna, Aus…
## 2 7470 Christian de Port… City of Music, National Superior Cons…
## 3 7605 Emil Hoppe Villa near Vienna Project, Outside Vi…
## 4 7056 Bernard Tschumi The Manhattan Transcripts Project, Ne…
## 5 7605 Emil Hoppe Villa, project, outside Vienna, Austr…
## 6 7056 Bernard Tschumi The Manhattan Transcripts Project, Ne…
## 7 7056 Bernard Tschumi The Manhattan Transcripts Project, Ne…
## 8 7056 Bernard Tschumi The Manhattan Transcripts Project, Ne…
## 9 7056 Bernard Tschumi The Manhattan Transcripts Project, Ne…
## 10 7056 Bernard Tschumi The Manhattan Transcripts Project, Ne…
## # ... with 142,201 more rows