Shiny app

Look through the shiny gallery. Shiny allows you to make amazing, interactive visualizations easily in R. While this is not a specific project proposal, these apps might serve as inspiration.

State voter registrations

Ryan Thornburg and the Reese News lab has a number of suggestions for a final project(s). If you are interested in these let me know. If you choose this project there may be a job offer at the end…

Below is an email from Professor Thornburg:

Here are a few things I’m interested in doing with the state voter file (which includes a variety of data tables – one table that includes demographic information about each registered voter, one table that includes information about whether a particular voter participated in a particular election, one file (for each election) that shows the number of votes that each candidate received in each precinct).

  • How reliably Democratic or Republican is each voting precinct?
  • Which precincts are “trending” in one direction or another? Becoming more or less Democratic, for example?
  • When a precinct votes against its party preference, are there significant differences in the demographics of the voters who turn out?
  • Based on the home address, age, gender, and race for each voter, use census data to find their expected educational attainment, income, employment, insurance coverage status, language spoken, marital status, migration status, national origin, family size, household relationships, veteran status, rent amount, mortgage amount, what utilities they have at home (such as high speed internet and indoor plumbing).
  • Based on the preference of a precinct, and the demographic makeup of voters in a particular election, can we say how much various demographic factors of voters generally contribute to candidate preference?
  • Does a precinct’s presidential, US Senate or gubernatorial candidate preference have any value predicting the precinct’s preferences in local, nonpartisan elections such as mayor and school board?
  • What demographic factors in each precinct are the best indicators of whether someone will turn out to vote in a specific election (local primary, local general, legislative/congressional general/primary, US Senate general/primary, gubernatorial general/primary, presidential general/primary)

word2vec and shiny

Make a shiny app exploring “word addition” with word2vec. The only prerequisite for this project is linear algebra. Let me know if your group is interested in this project.

Word embeddings are a way of assigning each word a vector that captures the “meaning” of words in some sense. Suppose we somehow assign each word (e.g. cat, program, beer, etc) a vector in \(\mathbb{R}^d\) where \(d\) might be any integer (typically somewhere between 100 and 1000). We might hope the following happens

if two words are close in meaning then their vectors are close together and vice versa.

The word2vec algorithm takes a large corpus of text (e.g. wikipedia) uses a neural network to assign each word in the corpus a vector. For a more in depth explanation see TensorFlow’s word2vec tutorial. You do not need to understand neural networks to do this project!

A good word embedding has some pretty amazing features; it might capture semantic and syntactic relationships. Since each word has a vector associated with it, we can do “word arithmetic”. For example, take the vectors associated with “brother”, “man” and “woman” and compute

“brother” - “man” + “woman”

The resulting vector will then be very close to the vector corresponding to “sister” (semantic meaning). Similarly

“walked” - “walking” + “swimming”

will be close to “swam” (syntactic). You even get analogies

“Spain” - “Madrid” + “Italy”

will be close to Rome. The fact that a well trained word2vec model actually exhibits these properties is kind of amazing.

The point of this project is to build a shiny app that let’s the user explore word arithmetic i.e. they input the words for A - B + C and the app spits out the resulting output word. You can find pre-trained word2vec models online. This is just matrix where there is one column per word and the columns are the vectors (if there are W words in the corpus and we embed the words into \(\mathbb{R}^d\) then the matrix is d \(\times\) W).

For a related idea see this 538 article on subbreddit algebra.