Look through the shiny gallery. Shiny allows you to make amazing, interactive visualizations easily in R. While this is not a specific project proposal, these apps might serve as inspiration.
Ryan Thornburg and the Reese News lab has a number of suggestions for a final project(s). If you are interested in these let me know. If you choose this project there may be a job offer at the end…
Below is an email from Professor Thornburg:
Here are a few things I’m interested in doing with the state voter file (which includes a variety of data tables – one table that includes demographic information about each registered voter, one table that includes information about whether a particular voter participated in a particular election, one file (for each election) that shows the number of votes that each candidate received in each precinct).
Make a shiny app exploring “word addition” with word2vec. The only prerequisite for this project is linear algebra. Let me know if your group is interested in this project.
Word embeddings are a way of assigning each word a vector that captures the “meaning” of words in some sense. Suppose we somehow assign each word (e.g. cat, program, beer, etc) a vector in \(\mathbb{R}^d\) where \(d\) might be any integer (typically somewhere between 100 and 1000). We might hope the following happens
if two words are close in meaning then their vectors are close together and vice versa.
The word2vec algorithm takes a large corpus of text (e.g. wikipedia) uses a neural network to assign each word in the corpus a vector. For a more in depth explanation see TensorFlow’s word2vec tutorial. You do not need to understand neural networks to do this project!
A good word embedding has some pretty amazing features; it might capture semantic and syntactic relationships. Since each word has a vector associated with it, we can do “word arithmetic”. For example, take the vectors associated with “brother”, “man” and “woman” and compute
“brother” - “man” + “woman”
The resulting vector will then be very close to the vector corresponding to “sister” (semantic meaning). Similarly
“walked” - “walking” + “swimming”
will be close to “swam” (syntactic). You even get analogies
“Spain” - “Madrid” + “Italy”
will be close to Rome. The fact that a well trained word2vec model actually exhibits these properties is kind of amazing.
The point of this project is to build a shiny app that let’s the user explore word arithmetic i.e. they input the words for A - B + C and the app spits out the resulting output word. You can find pre-trained word2vec models online. This is just matrix where there is one column per word and the columns are the vectors (if there are W words in the corpus and we embed the words into \(\mathbb{R}^d\) then the matrix is d \(\times\) W).
For a related idea see this 538 article on subbreddit algebra.