This lecture is about communication in various contexts in data science. It first provides four general principles of communication, then discusses some strategies that implement these principles, and finally examines how these suggestions apply in a few examples. The primary reference is Trees, Maps and Theorems by Jean-luc Doumont and other references are listed at the bottom of the page.

Effective communication is optimization under constraints. – Trees, Maps and Theorems

What makes for effective communication is context dependent; the context determines the constraints. The context often depends upon:

  • Audience (business executives, data scientists, etc)
  • Medium (slide show, text document, etc)
  • Purpose (convey results, impress an employer, etc)
  • Content (we built a predictive model, drug A doesn’t work, etc)
  • Time (do you have a day or a week to prepare?)

Communicating well is challenging and takes lots of practice. In my experience communication involves design, engineering and empathy as well as the ability to work with words. In data science it requires the ability to work in different mediums, for example:

  • Written document
  • Static visualizations
  • Dynamic visualizations
  • Interactive application (e.g. Shiny)
  • Slideshow
  • Web page
  • Speaking
  • Literate programming (e.g. R Markdown)

The principles discussed in this lecture apply to all of these mediums. These principles also apply to more contexts than just presenting results:

  • Coding an algorithm (other people, including a future you, need to be able to understand your code)
  • Coordinating with collaborators
  • Asking for help

Four general principles for commuication

This section presents four general rules for communicating effectively:

  1. Adapt to your audience.
  2. Maximize the signal to noise ratio.
  3. Use effective redundancy.
  4. There are usually trade-offs

The first three of these come from Trees, Maps and Theorems (these rules might remind you of information and coding theory). They are general principles which apply to many different contexts.

1. Adapt to your audience

Adapting to the audience means you take responsibility for the success or failure of your message reaching the audience. It comes naturally to us; you talk to your parents differently from how you talk to your best friend. The act of adapting requires empathy; you have to understand how the recipient perceives the information you are conveying. Adapting also requires some persistence; if the first strategy does not succeed then try another one.

Adapting is partially an act of generosity. How many hours of your life have been wasted sitting through a lecture that you didn’t get much out of? However, communicating well is also beneficial to your career.

Much like being customer-minded in business or being user-friendly in software development, adapting to one’s audience is really a question of effectiveness more than one of selflessness. – Trees, Maps and Theorems

Understanding who is in the audience is a critical step. Some common types of audience members you may face include:

  • Familiar or unfamiliar with the topic
  • Technical or non-technical
  • Expert in the topic at hand or in a similar but distinct topic
  • Native or non-native language speakers
  • Interested or uninterested in the topic

Many audiences are heterogenous which presents an extra challenge.

2. Maximize the signal to noise ratio

Nothing is neutral in communication. – Trees, Maps and Theorems

The audience sees every dot in a graphic and hears every word you speak. Anything that does not convey your message to the audience hampers your message. Often message optimization is more about minimizing noise than maximizing signal.

For visualization, simple and focused is often better than fancy and detailed. For example watch the progression of a poor graphic to an effective graphic in this blog post.

Documents/presentations/webpages should be consistent and minimal. All formatting (font, text size, structure, webpage style, graph colors, etc) should remain uniform throughout. Changes in formatting will draw the audience’s attention so use it selectively. Similarly, bolding and emphasis words (very, really, etc) should be used only occasionally.

Optimizing a message’s delivery first requires a clear understanding of exactly what the message is. Figuring out your thesis is not always easy; it sometimes takes several rounds of revision to home in on and understand your thesis.

3. Use effective redundancy

If you convey your message via multiple channels the audience has more than one chance to understand the message. When I lecture in class I communicate orally and visually (with a slideshow). A stop sign conveys its message in three ways: color, text and shape.

Redundancy can also mean repetition. It can be worth repeating important points multiple times. For example, my boss this past summer gave me this advice about presentations (originally from Aristotle):

Tell them what you are going to tell them. Tell them. Then tell them what you told them.

In oral presentations stating the main points both at the beginning and end will help the audience remember them.

4. Trade-offs

You may have come across the famous quote in an Economics class that means there are usually trade-offs to decisions:

There ain’t no such thing as a free lunch. – (popularized by) Milton Friedman

In the context of communication the biggest “cost” is likely to be time spent preparing; it takes time to communicate well. Tweaking a single graphic can take hours. Your goal is not perfection; your goal is “good enough” for the purpose at hand.

“Good enough” is once again context dependent. An email to your mom might get a quick glace over while an email to a potential employer likely takes multiple rounds of revision.

Other trade-offs might include:

  • Provide more detail vs. make the content easier to understand
  • Targeting one audience vs. another audience

Communication strategies

This section discusses a number of strategies for effective communication. Many of these suggestions are corollaries of the four principles discussed above and are not mutually exclusive.

Revise, revise, revise

Do many rounds of revision. This applies to coding, writing, oral presentations and making visualizations. Revision is one of the best ways to improve something. Stepping away then coming back will give your a fresh perspective. It will also help you catch errors.

You should attack any work you are editing (your own or someone else’s). To quote my father:

When editing go for the jugular. – Calum Carmichael

If you are publishing something publicly you should revise it several times. If it’s something that really matters then you should get outside feedback from several people.

State the message first then state the details

An effective presentation/document states the message before the details of the argument. The message is (usually) more important than the details. Therefore the structure of the document should reflect this inequality. The audience may not appreciate the message without motivation so providing context should come before the message.

Most professional communications are structured as:

  1. Motivation (what is the problem and why should the audience care?)
  2. The main message (how did you solve the problem?)
  3. The details supporting the main results.

When designing a graphic, presentation or document you should have a relentless focus on conveying the message – the “so what?” The following quote comes from a blog post about conveying your message that is worth reading (see here).

Too often, when we communicate with data, we don’t make our point clear. We leave our audience guessing. Your audience should never have to guess what message you want them to know. The onus is on the person communicating the information (you!) to make that clear. – Cole Knaflic

State the upshot of your presentation explicitly and at the beginning. It is tempting to save the conclusions until after presenting the details of the analysis or to assume the audience will just understand the point without you explicitly stating it. While these strategies may be be effective for writing a novel, they are not effective for technical communication.

The rule of message before details applies both at a macro-scale (e.g. executive summary) and at a micro-scale (e.g. plot titles). Some concrete recommendations include:

  • Always include an executive summary/abstract at the beginning of the document.
  • State the upshot of the graphic in the title (e.g. see this post).
  • Similarly for a PowerPoint state the upshot in the title of a slide.
  • Functions should have informative names (e.g. str_extract (good) vs grep (bad)).
  • When communicating a complex concept (e.g. a theorem) state the intuition before formally defining the concept.

Focusing on the message first requires that you understand the message. My high school English teacher’s favorite quote was

How can I know what I think until I see what I say.

You sometimes have to write most of the paper until you understand what the thesis is. It’s ok to leave deciding or clarifying the message until you are almost done composing a document or graphic.

Hierarchical is better than sequential

Humans tend to process hierarchical information better than sequential information.

We categorize living creatures into a hierarchical taxonomy (kingdom, phylum, …). Textbooks are organized into: chapters, sections, and subsections. A complex function is composed of many helper functions.

For a more concrete example, compare:

My research has both theoretical and applied components: dimensionality reduction for network valued random variables, temporally evolving preferential attachment models, support vector machine in high dimensional settings, DTI structural connectivity networks, text analysis of Supreme Court decisions.

to

My research has two components.

Theory

  • Dimensionality reduction for network valued random variables.
  • Temporally evolving preferential attachment models.
  • Support vector machine in high dimensional settings.

Application

  • DTI structural connectivity networks.
  • Text analysis of Supreme Court decisions.

The depth of the hierarchy you use is dependent upon the medium. For a written document try not to use more than three levels (chapters, sections, and subsections). For an oral presentation two levels is better.

Make the structure easy to navigate and understand

The audience should be able to navigate easily around a document. Furthermore, the audience should know where they are at all times.

My lecture notes have a floating table of contents on the left side. Textbooks provide a table of contents, number section headings and provide page numbers at the bottom. Websites typically have a site map at the top. In oral presentations you should show the outline at the beginning and periodically state where you are in the lecture.

Communicate at different levels

A technical document often communicates to different types of audiences; the same document might be read by both executives and data scientists. The same person might shift categories; the first time I read a paper I’m looking for the upshot/core ideas while the second time I read the paper I try to understand the technical details. Therefore, a document (paper, PowerPoint, etc) should communicate its message at multiple levels.

Stating the message then the details is an example of communicating at different levels. Academic papers include details both in the body of the paper and the appendix.

Visual communication: static plots

This section compares exploratory vs. communicatory plots and then discusses some ways in which plots can be misleading.

The first two plots below compare different ways of visualizing the same data – one version for exploration and one version for communication. The data for these plots are rental counts per hour over the course of one year for a bike sharing service (see previous lecture). An exploratory analysis discovered that the rental trends throughout the day was qualitatively different between working and non-working days (e.g. M-F vs. weekends).

The code that creates these plots can be found here.

Exploratory plot: speed and details

When you first analyze a data set you will rapidly make many exploratory plots. These plots should contain as much information as possible. Exploratory plots emphasize details over message and quantity over quality.

The target audience for an exploratory plot is the person making it (and maybe their collaborators). It is created quickly (2 lines of code) and contains lots of information (e.g. every data point).

Since minimizing speed and maximizing information are important for an exploratory plot I just kept the default values for ggplot. This plot would be better as an exploratory plot if I had used a jitter plot.

Communicatory plot: message and quality

After concluding an analysis, the final presentation(s) will likely have several figures in them whose purpose is to effectively communicate the findings to the intended audience. The purpose of a communicatory plot is message, not details. Often creating a good communicatory visualization is about decluttering i.e. do less! For example, watch this blog post step through decluttering a poor visualization.

The plot below is designed to communicate the findings to a general audience (e.g. business executives). This plot took longer to make (30 lines of code) and focuses on the message.

## Warning: package 'bindrcpp' was built under R version 3.4.4

Takeaways from the communicatory plot:

  • Title states the message of the plot.
  • Use median count per hour as a summary (i.e. we don’t plot every data point).
  • Axes are labeled properly.
  • No background grid and fewer axis ticks.
    • The precise numbers are not important for the message.
  • Rush hour peaks are labeled for emphasis.
  • Working data is coded in two ways:
    • Color
    • Line type

Misleading figures

It is possible to mislead both yourself and others with visualizations.

Dynamic visualizations

Some messages are most effectively communicated with a non-static plot. These visualizations may add a time dimension to a 2 dimensional, static plot (e.g. a gif or a movie). They also might give the user the ability to interact with the visualization.

Here are some examples worth looking through:

The majority of concepts are best communicated through a simple, static plot. Creating a fancy plot often adds noise to your message. There are some ways in which a dynamic plot can be effective.

  • Time adds a third dimension to a static, 2D plot.

  • There might be several related points you want to convey with similar visualizations. You could convey these points with a number of plots listed sequentially. An interactive plot where the audience can tweak some parameters creates hierarchy.

  • The audience might want to look through the data themselves. An interactive plot adapts to the audience by letting them look into what they are interested in.

  • The audience might want to dig into particular details. An interactive plot where the audience can mouse over points and pull up details both adapts to the audience and uses hierarchy.

  • Industry loves dashboards.

You already have the ability to make interactive and dynamic plots. Shiny allows you to create interactive visualizations. Creating a gif in R is not hard (see here).

Many of the visualizations above were created with D3 which is a JavaScript library for creating amazing visualizations in a web browser.

Programming

Writing code is an act of communication with two audiences: the computer and other programmers (including future you). This section discusses some strategies for writing better code (difficult to understand code == buggy code).

Code is a set of instructions that the computer will follow literally. If you make a syntax mistake the code won’t run. If you write code that runs, but isn’t what you meant to write the computer will still listen to you.

Most code will be revisited by you or someone else in the future (e.g. revisions, modification, as an example, re-factoring, etc). It is important to write code in such a way so this future person can understand what the code is doing and why you made the choices that you did.

You will pick up good software engineering principles with practice and through working with people who are more experienced than you. While software engineering is beyond the purview of this course, many software engineering principles are really about design and communication: writing code in a way that minimizes errors and maximizes human efficiency.

Functions and readable code

You should write a lot of functions when your code. Functions promote code reuse which makes your code faster to write and more likely to work well (see section 19 from r4ds).

Here are some suggestions that will make your code easier to understand:

  • Break a complex function into several helper functions (an example of hierarchy).

  • Functions and variables should have informative names (e.g. str_extract, mean_income <- mean(data$income)).

  • Variable names (and file names) should use a consistent, standard format (CammelCase or snake_case).

  • Use line breaks to visually organize code into smaller sections.

  • Write a comment for each function describing what the input is and what the function does.

  • Comment your code: over-commenting > under-commenting. Use comments to explain design choices that might not be obvious.

You code should be easy for a human to read and understand.

Folder organization

For a complex coding project you should organize your work into folders and sub-folders. Consider the github repositories for the tidytext package (see here).

The repository contains a README document (analogous to the executive summary). The package is organized hierarchically using folders and sub-folders. For example, the core R code behind the package is in the /R sub-folder, test code is in /tests.

The code itself is organized into many functions and separate scripts (see here). The package also has well written vignettes (see here or here). These vignettes contain minimal, reproducible snippets of example code that demonstrate how to get started with the package.

R Markdown

R Markdown (RMD) is a powerful medium for communication that allows you to weave code into a text document with some basic formatting. This section first discusses RMD’s capabilities as a text editor then literate programming.

For instructions how to use more of R Markdown’s capabilities see the R Markdown cheetsheet.

With R Markdown you can easily create:

  • Websites (390’s website was composed entirely in RMD)
  • Data analysis reports
  • Books (r4ds was written in RMD)
  • Resumes or academic papers
  • Slide show presentation
  • R Package Vignettes (see here)
  • Dashboards (e.g. see here)

See the RMD gallary for more ideas and example code.

Markdown: basic text editor

R Markdown can be used as a basic text editor that easily creates HTML documents with some light formatting. It’s capabilities include:

  • Text formatting
    • bold, italics, strike-through`
  • Add links
  • Create sections and subsections
  • Add block quotes
  • Create lists
  • Create tables
  • Add images
  • Include R code
  • Knit to different document formats (pdf, html, word document)
  • Customize the HTML document with the YAML header

These tools should used in the way the best communicates the message to the audience i.e. following the principles discussed above. Some more concrete recommendations include:

  • Be selective with emphasis (this article gets carried away with bolding – I’m certainly guilty of this occasionally).
  • Draw attention to important links. For example you might explicitly tell an audience to see a link (for important information see here).
  • Use sections/subsections to create a hierarchical document.
  • Add a floating table of contents to a longer document so the user has a map.

R Markdown: literate programming

Traditional literate programming is about making a complex program easy to read for a programmer by including documentation/commentary in the code. In the context of data science literate programming means you can weave code, figures and text together into one document. This presentation gives a good overview of using RMD for literate programming and I will quote a few of the slides in this section.

In the context of data science reproducibility means the ability for someone else to access your data, run your code and get the same results you got. This is surprisingly challenging. With R Markdown you can write all of the analysis code in the same document that you use to communicate it. The .Rmd file is now reproducible research!

Reproducibility is also about communication. In data science the content of an analysis is often the code of the analysis. Consider the following quote:

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures. – David Donoho

A complicated data analysis can be described by a workflow chart (e.g. see here). However a verbal or visual description of an analysis is only a summary of the analysis. The actual analysis is contained in the code. If I really want to know what you did I need to be able to see your code.

Asking questions

Asking questions effectively will improve the chances that you get a good answer. Whether you email a colleague or post a question to StackOverflow – ineffective questions waste your time and other people’s time.

For instructions on how to ask effective questions about programming read the following three short posts:

There are two posts about reproducible examples because they are so important.

Additional References

Unfortunately Trees, Maps and Theorems is not free online, however you can find suplementary material here.

Edward Tufte has a number of famous books on communicating visual evidence (see here).

Chart dos and don’ts

Tips for Giving Clear Talks

viz.wtf/ has some wonderful examples of bad visualizations

Story Telling with Data is an excellent blog/book on communicating with data.

This article explains why interactive visualizations are now becoming effective forms of journalism.

The graphics for communication section of r4ds has good recommendations for visualization and demonstrates how to customize ggplot.

Mike Bostock’s blog post on what makes good software illustrates how some of the principles discussed in this lecture apply to writing code (see here). He also has a post on using visualization to understand algorithms (see here).

This blog post discusses reasons why the New York Times is so successful which you can learn from.

Kieran Healy’s class on data visualization has a lot of good resources and advice (see here).

Effective writing in mathematical statistics by Steve Marron.

For more on reproducible research see these articles (1, 2, 3).

Flowing Data is a wonderful blog on visualizations.