This lecture is about communication in various contexts in data science. It first provides four general principles of communication, then discusses some strategies that implement these principles, and finally examines how these suggestions apply in a few examples. The primary reference is Trees, Maps and Theorems by Jean-luc Doumont and other references are listed at the bottom of the page.
Effective communication is optimization under constraints. – Trees, Maps and Theorems
What makes for effective communication is context dependent; the context determines the constraints. The context often depends upon:
Communicating well is challenging and takes lots of practice. In my experience communication involves design, engineering and empathy as well as the ability to work with words. In data science it requires the ability to work in different mediums, for example:
The principles discussed in this lecture apply to all of these mediums. These principles also apply to more contexts than just presenting results:
This section presents four general rules for communicating effectively:
The first three of these come from Trees, Maps and Theorems (these rules might remind you of information and coding theory). They are general principles which apply to many different contexts.
Adapting to the audience means you take responsibility for the success or failure of your message reaching the audience. It comes naturally to us; you talk to your parents differently from how you talk to your best friend. The act of adapting requires empathy; you have to understand how the recipient perceives the information you are conveying. Adapting also requires some persistence; if the first strategy does not succeed then try another one.
Adapting is partially an act of generosity. How many hours of your life have been wasted sitting through a lecture that you didn’t get much out of? However, communicating well is also beneficial to your career.
Much like being customer-minded in business or being user-friendly in software development, adapting to one’s audience is really a question of effectiveness more than one of selflessness. – Trees, Maps and Theorems
Understanding who is in the audience is a critical step. Some common types of audience members you may face include:
Many audiences are heterogenous which presents an extra challenge.
Nothing is neutral in communication. – Trees, Maps and Theorems
The audience sees every dot in a graphic and hears every word you speak. Anything that does not convey your message to the audience hampers your message. Often message optimization is more about minimizing noise than maximizing signal.
For visualization, simple and focused is often better than fancy and detailed. For example watch the progression of a poor graphic to an effective graphic in this blog post.
Documents/presentations/webpages should be consistent and minimal. All formatting (font, text size, structure, webpage style, graph colors, etc) should remain uniform throughout. Changes in formatting will draw the audience’s attention so use it selectively. Similarly, bolding and emphasis words (very, really, etc) should be used only occasionally.
Optimizing a message’s delivery first requires a clear understanding of exactly what the message is. Figuring out your thesis is not always easy; it sometimes takes several rounds of revision to home in on and understand your thesis.
If you convey your message via multiple channels the audience has more than one chance to understand the message. When I lecture in class I communicate orally and visually (with a slideshow). A stop sign conveys its message in three ways: color, text and shape.
Redundancy can also mean repetition. It can be worth repeating important points multiple times. For example, my boss this past summer gave me this advice about presentations (originally from Aristotle):
Tell them what you are going to tell them. Tell them. Then tell them what you told them.
In oral presentations stating the main points both at the beginning and end will help the audience remember them.
You may have come across the famous quote in an Economics class that means there are usually trade-offs to decisions:
There ain’t no such thing as a free lunch. – (popularized by) Milton Friedman
In the context of communication the biggest “cost” is likely to be time spent preparing; it takes time to communicate well. Tweaking a single graphic can take hours. Your goal is not perfection; your goal is “good enough” for the purpose at hand.
“Good enough” is once again context dependent. An email to your mom might get a quick glace over while an email to a potential employer likely takes multiple rounds of revision.
Other trade-offs might include:
This section discusses a number of strategies for effective communication. Many of these suggestions are corollaries of the four principles discussed above and are not mutually exclusive.
Do many rounds of revision. This applies to coding, writing, oral presentations and making visualizations. Revision is one of the best ways to improve something. Stepping away then coming back will give your a fresh perspective. It will also help you catch errors.
You should attack any work you are editing (your own or someone else’s). To quote my father:
When editing go for the jugular. – Calum Carmichael
If you are publishing something publicly you should revise it several times. If it’s something that really matters then you should get outside feedback from several people.
An effective presentation/document states the message before the details of the argument. The message is (usually) more important than the details. Therefore the structure of the document should reflect this inequality. The audience may not appreciate the message without motivation so providing context should come before the message.
Most professional communications are structured as:
When designing a graphic, presentation or document you should have a relentless focus on conveying the message – the “so what?” The following quote comes from a blog post about conveying your message that is worth reading (see here).
Too often, when we communicate with data, we don’t make our point clear. We leave our audience guessing. Your audience should never have to guess what message you want them to know. The onus is on the person communicating the information (you!) to make that clear. – Cole Knaflic
State the upshot of your presentation explicitly and at the beginning. It is tempting to save the conclusions until after presenting the details of the analysis or to assume the audience will just understand the point without you explicitly stating it. While these strategies may be be effective for writing a novel, they are not effective for technical communication.
The rule of message before details applies both at a macro-scale (e.g. executive summary) and at a micro-scale (e.g. plot titles). Some concrete recommendations include:
str_extract
(good) vs grep
(bad)).Focusing on the message first requires that you understand the message. My high school English teacher’s favorite quote was
How can I know what I think until I see what I say.
You sometimes have to write most of the paper until you understand what the thesis is. It’s ok to leave deciding or clarifying the message until you are almost done composing a document or graphic.
Humans tend to process hierarchical information better than sequential information.
We categorize living creatures into a hierarchical taxonomy (kingdom, phylum, …). Textbooks are organized into: chapters, sections, and subsections. A complex function is composed of many helper functions.
For a more concrete example, compare:
My research has both theoretical and applied components: dimensionality reduction for network valued random variables, temporally evolving preferential attachment models, support vector machine in high dimensional settings, DTI structural connectivity networks, text analysis of Supreme Court decisions.
to
My research has two components.
Theory
Application
The depth of the hierarchy you use is dependent upon the medium. For a written document try not to use more than three levels (chapters, sections, and subsections). For an oral presentation two levels is better.
A technical document often communicates to different types of audiences; the same document might be read by both executives and data scientists. The same person might shift categories; the first time I read a paper I’m looking for the upshot/core ideas while the second time I read the paper I try to understand the technical details. Therefore, a document (paper, PowerPoint, etc) should communicate its message at multiple levels.
Stating the message then the details is an example of communicating at different levels. Academic papers include details both in the body of the paper and the appendix.
This section compares exploratory vs. communicatory plots and then discusses some ways in which plots can be misleading.
The first two plots below compare different ways of visualizing the same data – one version for exploration and one version for communication. The data for these plots are rental counts per hour over the course of one year for a bike sharing service (see previous lecture). An exploratory analysis discovered that the rental trends throughout the day was qualitatively different between working and non-working days (e.g. M-F vs. weekends).
The code that creates these plots can be found here.
When you first analyze a data set you will rapidly make many exploratory plots. These plots should contain as much information as possible. Exploratory plots emphasize details over message and quantity over quality.
The target audience for an exploratory plot is the person making it (and maybe their collaborators). It is created quickly (2 lines of code) and contains lots of information (e.g. every data point).
Since minimizing speed and maximizing information are important for an exploratory plot I just kept the default values for ggplot
. This plot would be better as an exploratory plot if I had used a jitter plot.
After concluding an analysis, the final presentation(s) will likely have several figures in them whose purpose is to effectively communicate the findings to the intended audience. The purpose of a communicatory plot is message, not details. Often creating a good communicatory visualization is about decluttering i.e. do less! For example, watch this blog post step through decluttering a poor visualization.
The plot below is designed to communicate the findings to a general audience (e.g. business executives). This plot took longer to make (30 lines of code) and focuses on the message.
## Warning: package 'bindrcpp' was built under R version 3.4.4
Takeaways from the communicatory plot:
It is possible to mislead both yourself and others with visualizations.
Some messages are most effectively communicated with a non-static plot. These visualizations may add a time dimension to a 2 dimensional, static plot (e.g. a gif or a movie). They also might give the user the ability to interact with the visualization.
Here are some examples worth looking through:
The majority of concepts are best communicated through a simple, static plot. Creating a fancy plot often adds noise to your message. There are some ways in which a dynamic plot can be effective.
Time adds a third dimension to a static, 2D plot.
There might be several related points you want to convey with similar visualizations. You could convey these points with a number of plots listed sequentially. An interactive plot where the audience can tweak some parameters creates hierarchy.
The audience might want to look through the data themselves. An interactive plot adapts to the audience by letting them look into what they are interested in.
The audience might want to dig into particular details. An interactive plot where the audience can mouse over points and pull up details both adapts to the audience and uses hierarchy.
Industry loves dashboards.
You already have the ability to make interactive and dynamic plots. Shiny allows you to create interactive visualizations. Creating a gif in R is not hard (see here).
Many of the visualizations above were created with D3 which is a JavaScript library for creating amazing visualizations in a web browser.
Writing code is an act of communication with two audiences: the computer and other programmers (including future you). This section discusses some strategies for writing better code (difficult to understand code == buggy code).
Code is a set of instructions that the computer will follow literally. If you make a syntax mistake the code won’t run. If you write code that runs, but isn’t what you meant to write the computer will still listen to you.
Most code will be revisited by you or someone else in the future (e.g. revisions, modification, as an example, re-factoring, etc). It is important to write code in such a way so this future person can understand what the code is doing and why you made the choices that you did.
You will pick up good software engineering principles with practice and through working with people who are more experienced than you. While software engineering is beyond the purview of this course, many software engineering principles are really about design and communication: writing code in a way that minimizes errors and maximizes human efficiency.
You should write a lot of functions when your code. Functions promote code reuse which makes your code faster to write and more likely to work well (see section 19 from r4ds).
Here are some suggestions that will make your code easier to understand:
Break a complex function into several helper functions (an example of hierarchy).
Functions and variables should have informative names (e.g. str_extract
, mean_income <- mean(data$income)
).
Variable names (and file names) should use a consistent, standard format (CammelCase or snake_case).
Use line breaks to visually organize code into smaller sections.
Write a comment for each function describing what the input is and what the function does.
Comment your code: over-commenting > under-commenting. Use comments to explain design choices that might not be obvious.
You code should be easy for a human to read and understand.
For a complex coding project you should organize your work into folders and sub-folders. Consider the github repositories for the tidytext
package (see here).
The repository contains a README document (analogous to the executive summary). The package is organized hierarchically using folders and sub-folders. For example, the core R code behind the package is in the /R
sub-folder, test code is in /tests
.
The code itself is organized into many functions and separate scripts (see here). The package also has well written vignettes (see here or here). These vignettes contain minimal, reproducible snippets of example code that demonstrate how to get started with the package.
R Markdown (RMD) is a powerful medium for communication that allows you to weave code into a text document with some basic formatting. This section first discusses RMD’s capabilities as a text editor then literate programming.
For instructions how to use more of R Markdown’s capabilities see the R Markdown cheetsheet.
With R Markdown you can easily create:
See the RMD gallary for more ideas and example code.
R Markdown can be used as a basic text editor that easily creates HTML documents with some light formatting. It’s capabilities include:
These tools should used in the way the best communicates the message to the audience i.e. following the principles discussed above. Some more concrete recommendations include:
Traditional literate programming is about making a complex program easy to read for a programmer by including documentation/commentary in the code. In the context of data science literate programming means you can weave code, figures and text together into one document. This presentation gives a good overview of using RMD for literate programming and I will quote a few of the slides in this section.
In the context of data science reproducibility means the ability for someone else to access your data, run your code and get the same results you got. This is surprisingly challenging. With R Markdown you can write all of the analysis code in the same document that you use to communicate it. The .Rmd file is now reproducible research!
Reproducibility is also about communication. In data science the content of an analysis is often the code of the analysis. Consider the following quote:
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures. – David Donoho
A complicated data analysis can be described by a workflow chart (e.g. see here). However a verbal or visual description of an analysis is only a summary of the analysis. The actual analysis is contained in the code. If I really want to know what you did I need to be able to see your code.
Asking questions effectively will improve the chances that you get a good answer. Whether you email a colleague or post a question to StackOverflow – ineffective questions waste your time and other people’s time.
For instructions on how to ask effective questions about programming read the following three short posts:
There are two posts about reproducible examples because they are so important.
Unfortunately Trees, Maps and Theorems is not free online, however you can find suplementary material here.
Edward Tufte has a number of famous books on communicating visual evidence (see here).
viz.wtf/ has some wonderful examples of bad visualizations
Story Telling with Data is an excellent blog/book on communicating with data.
This article explains why interactive visualizations are now becoming effective forms of journalism.
The graphics for communication section of r4ds has good recommendations for visualization and demonstrates how to customize ggplot.
Mike Bostock’s blog post on what makes good software illustrates how some of the principles discussed in this lecture apply to writing code (see here). He also has a post on using visualization to understand algorithms (see here).
This blog post discusses reasons why the New York Times is so successful which you can learn from.
Kieran Healy’s class on data visualization has a lot of good resources and advice (see here).
Effective writing in mathematical statistics by Steve Marron.
For more on reproducible research see these articles (1, 2, 3).
Flowing Data is a wonderful blog on visualizations.