4  Choosing the Right Plot

A data visualization represents data in a simplified way by encoding that data with textual and graphical design elements which are then combined in a specific layout. Each plot type has a certain combination of the design elements and an associated typical configuration. The design elements act as building blocks for the overall data visualizations.

Therefore, deciding which design elements and plot type to use depends on:

The previous chapter discussed the types of data and how those are encoded or appropriate visual design elements to represent certain types of data. This chapter discusses the questions typical visualizations can answer and puts together all of the considerations in describing standard plot types.

4.1 Learning Objectives

Learning objectives This chapter will demonstrate how to Describe questions that data visualizations attempt to answer, analyze which plot works with your data types and the question you want to answer, evaluate strengths and weaknesses of various standard plots, and identify resources that can help you choose which plot works for you.

4.2 Goals of Data Visualization

There are many different possible goals for creating a data visualization. Among those include

  • looking at a distribution
  • showing a correlation or other relationship between variables
  • finding a ranking
  • displaying change over time or after an intervention
  • indicating location and spatial relationships or distributions

4.3 Standard Graph Types

Standard graph types include conventional graphs like histograms, scatter plots, bar plots, and line charts as well as other graphs used to look at distributions or find rankings or used to consider group composition, complex relationships, and response. For the conventional plots, this section will describe the appropriate kinds and amount of data needed to use that plot type as well as questions that plot type can be used to answer. For the graphs beyond those four, additional information will be provided such as a description, strengths, weaknesses, and alternative plots.

Conventional graph types include histograms, scatter plots, bar plots, and line charts. Each of these are used to answer a different question.

Histogram Scatter plot Bar plot Line chart
Primary type of data Numerical Numerical Numerical & Categorical Numerical
Minimum number of variables 1 2 2 2
Primary data encoding Area and Position Point and Position Area & Labels/Position Line and Position
Goal looking at a distribution showing a correlation finding a ranking displaying change over time
Additional data and data encodings for conventional graph types

A histogram may visualize more than one distribution within the same visualization. In that case, the dataset will have categorical data and more than one variable. Such a visualization would likely incorporate color or patterns and should increase opacity/transparency.

A scatter plot may have categorical data that can be encoded with shape or color. Any additional numerical variables beyond the minimum of 2 (one for the x-axis and one for the y-axis) could be encoded with size or color (using a sequential color palette).

A bar plot tends to have categories and counts for those categories (two variables, one categorical and one numerical), though a dataset may have just the categories which would need to be counted (which some visualization software can do this for the researcher). Additional data could represent some grouping or categorization of the categorical data (another categorical variable) in which case color may be used to encode that additional variable.

A line chart may or may not use points in addition to lines. It is good practice for the researcher to display the data points and not just the lines connecting them. If the dataset contains additional variables beyond the minimum of 2 (one for the x-axis and one for the y-axis), categorical data could be encoded with shape or color while numerical data could be encoded with size or color (using a sequential color palette).

4.4 Ranking and Distribution Plots

Of the conventional plots described above, bar plots and histograms both fit within this category. Bar plots are typically used to find a ranking while histograms display a distribution. Ranking and distribution plots are often, but not exclusively, used for exploratory data analysis and internal visualizations. If presented as an expository visualization, these plots will usually supplement or support a more complex figure, perhaps as a subpanel within that larger figure. Even if ranking and distribution plots are rarely the focal point, they are a powerful tool in the researcher’s data visualization toolbox to profile data characteristics and compare groups.

4.4.1 Violin Plots

Description: Displays the density of data where narrow sections correspond to lower density and wider sections correspond to higher density. The plots generally resemble the outline of a violin.

When to use Violin plots:

  • Numerical data (may have categorical data if looking at multiple distributions)
  • Minimum of 1 variable
  • Best with minimal number of categories
  • Primary data encodings are area and position
  • Used to show distributions and rank groups

Strengths of Violin plots include:

  • Better at comparing multiple distributions than histograms (the conventional plot type)
  • Displays distribution shape such as skew and modes better than boxplots do

Weaknesses of Violin plots include:

  • Doesn’t show individual data points
  • Limited usefulness when there’s little amounts of data (e.g., few records/patients/samples)
  • The method for creating the density visualization is a bit arbitrary

Alternatives for Violin plots include:

  • Boxplot (especially with jitter to show individual data points)
  • Ridgeline plot
  • Raincloud plot

An example of the Violin plot in the biomedical, bioinformatics, or cancer informatics literature is Figure 5c in this paper showing the expression of a tumor suppressor gene based on the genotype of an eQTL associated SNP: https://www-nature-com.fhcrc.idm.oclc.org/articles/s41586-024-07708-2.

4.4.2 Boxplots (with jitter)

Description: Boxplots use a box to delineate the median and quartiles of the data with whiskers to display the range and potentially points for outliers. A “jitter” can be utilized to include points for each data point.

Jitter and Randomness

Inclusion of jitter typically uses randomness to space the points over the box so you may need to set a random seed in your analysis to ensure the visualization appears exactly one way consistently. However, it’s not a major problem if the jitter points move on a plot as they’ll have the same numerical axis location, only moving within the categorical axis.

When to use Boxplots:

  • Numerical data (may have categorical data if looking at multiple distributions)
  • Minimum of 1 variable
  • Best if used with intermediate to large sample sizes
  • Primary data encodings are lines and points
  • Used to show distributions and rank groups
  • Shows the data as well as helpful summary statistics and overall variation
  • Can quickly spot outliers
  • Can compare groups efficiently
  • Doesn’t show multiple modes as well as violin plots
  • Quartiles may not be familiar concepts for general audiences
  • Outliers need further investigation
  • Violin plot
  • Histogram
  • Ridgeline plot
  • Strip plot (shows the individual data points only)
  • Raincloud plot
  • Swarm plot

Include an example plot

4.4.3 Ridgeline Plots

Description: Shows the distribution of multiple categories of numerical data, using densities (half a violin). Looks like stacked, but slightly offset mountain ridges.

When to use Ridgeline plots:

  • Numerical data (with categorical data for groups)
  • Minimum of 2 variables
  • Best with a higher number of categories
  • Primary data encodings are area and position
  • Used to show distributions and rank groups
  • Efficiently displays distributions for many groups or categories
  • Shows distribution shape (e.g., skew and modes)
  • Can be cluttered if there is too much overlap between groups or not a clear ordering of the categories
  • Good for general comparison among groups, but not for finding precise quantitative differences
  • The method for creating the density visualizations is a bit arbitrary
  • Boxplot (especially with jitter to show individual data points)
  • Strip plot
  • In some cases, a sorted heatmap
  • A table may be better if you need to make quantitative comparisons

Include an example plot

4.4.4 Raincloud Plots

Description: Combines 3 plots: density, boxplot, individual data points. Generally looks like a raining cloud.

When to use Raincloud plots:

  • Numerical data (with categorical data for groups)
  • Minimum of 1 variable
  • Best with minimal number of categories
  • Primary data encodings are area, lines, and points
  • Used to show distributions and rank groups
  • Show density/distribution shape as well as individual data points and common summary statistics – so much information
  • Works well with small datasets
  • A lot is going on
  • Harder to interpret and cluttered or squished if there are more than a few groups

Include an example plot

4.4.5 Strengths Overview

The strengths of the plots that help you visualize distributions and compare groups

Violin plots are good if you want to look at distribution characteristics (e.g., skew or multimodality) at a glance.

Boxplots are great if you want to see the data and general summary statistics (e.g., median).

Ridgeline plots are great at comparing distributions of lots of groups.

Raincloud plots are great when you want to go deep and see everything for a small number of groups (exploratory analysis).

4.5 Group Composition, Complex Relationships, and Response Plots

Of the conventional plots described above, scatter plots and line charts both fit within this category. Scatter plots are usually used to display a response as well as complex relationships while line charts may show a response or change over time. Plots within this category are more frequently expository data visualizations than exploratory data visualizations. These plots are often less intuitive for an audience to interpret without guidance. Therefore the polishing and refinement steps as well as including informative titles, captions, and labels are especially important when designing and building visualizations within this category.

4.5.1 Stacked Bar Charts

Description: Displays multiple categories within a single bar, stacked on top (vertical) or next to (horizontal) each other. Each overall bar represents a group. Within bars, each stack is a sub-category or subgroup.

When to use Stacked Bar charts:

  • Numerical data with categorical data for groups
  • Minimum of 2 variables
  • Primary data encodings are bars, position, and color or pattern
  • Used to find a ranking or change over time
  • Compare sub-categories within a bar or consider how sub-category contribution to the total bar composition changes over time/across bars
  • Avoid if you have a lot of sub-categories or want to compare individual sub-categories to each other
  • If using more than a couple of categories, consider using proportions (that sum to 100%) on the numerical axis
  • Summarizes complex group composition in a compact way
  • Can compare composition over time or across groups
  • Difficulty comparing subgroups that aren’t aligned or in the middle of the overall bar
  • If one subgroup contributes too much, it can dwarf other subgroups; similarly, if the groups have uneven sizes, it can dwarf other groups
  • If the numerical axis uses proportions, raw numbers aren’t visualized
  • Grouped bar chart
  • Lollipop plot
  • Line chart
  • A table
  • A waterfall plot for the most important sub-category
Example of the plot in the biomedical, bioinformatics, or cancer informatics literature

https://www.mdpi.com/2079-9721/13/2/37

4.5.2 Heatmaps

Description: A visualization (typically two dimensional – with rows and columns – of a table or matrix) using color. Always uses colors to display numbers or categories; the two dimensional matrix while typical is not a requirement, because the two dimensions could be a map projection or tissue sample.

When to use Heatmaps:

  • Categorical or numerical data or a mixture
  • Minimum of 3 variables
  • Primary data encodings are color and position
  • Used to look at a distribution across samples or change over time/after an intervention
  • May also exhibit pairwise correlation for many groups
  • Not recommended if you need to know or display actual values
  • Often used together with hierarchical clustering or reordering sample location to highlight trends or areas of similarity
  • Provides a broad overview to identify trends, areas of density, sparsity, or outliers
  • Efficiently and compactly simplifies complex data into one visualization
  • The “matrix” aspect can be difficult for viewers to grasp
  • Rearranging data can be helpful but also misleading if not described
  • Actual values and summary stats are rare/difficult to display if there are many rows and columns
Pairwise correlation: Full matrix vs lower triangle

When using a heatmap to display pairwise correlation, the heatmap is symmetrical and presents redundant information (each correlation is displayed twice) because the correlation of A to B is the same as the correlation of B to A. The diagonal will always have perfect correlation because it’s comparing a sample or variable to itself. If looking at just the lower triangle, it presents the pairwise correlation only once and removes the redundant information.

4.5.3 Sankey Diagrams

Description: Represents the “flow” or movement of quantities (e.g., patients or samples) among different categories through different steps (e.g., demographics or steps of an analysis or trial)

When to use Sankey diagrams:

  • Categorical data and sometimes numerical data/counts
  • Requires several variables
  • Primary data encodings are lines, area, and color
  • Used to show change over time
  • Best with datasets with some sort of longitudinal data or time points with categories at each time point
  • Example uses include tracking patient response or symptom trajectories as well as patient cohort stratification
  • Provides a lot of information about the study population and shows all possible paths
  • Provides a visual or graphical representation of data that would otherwise be in a table or just described in text
  • Difficult or complex to interpret
  • Flows that cross or overlap or flows with similar widths can be difficult to compare though interactivity can alleviate this
  • Diagrams don’t always clearly and explicitly display sample dropout (discontinue treatment, etc.)
  • Often require specialized tools to create visualization
  • A table
  • Stacked bar charts, but these lose the connection/possible path aspect
  • Flowcharts
  • Treemaps
  • Other graphical visualizations or some custom made visualization

Include an example of the plot

4.5.4 Strengths Overview

The strengths of the plots for group composition, complex relationships, and response

Stacked bar charts are good if you want to look at how subgroup composition changes at a glance over time, but it is less useful with lots of subgroups.

Heatmaps are good if you have at least 3 variables that you need to visualize and want to use color as the primary means to communicate the point of the visualization (e.g., the range of values, correlation, areas of density, patterns across groups or time)

Sankey diagrams are good if you want to look at the possible trajectories of symptoms, treatments, etc. and it’s important to know what combinations occur.

4.6 Field Standard Plots

There are several common plots that are “field standard” or customarily used within the biomedical field. These plots (including Waterfall plots, Caterpillar plots, Kaplan-Meier curves, Volcano plots, MA diagrams, etc.) plots will be discussed in the final chapter. They often align with the category of group composition, complex relationships, and response plots. When deciding what plot is best for you and your research question, a field standard plot may be the answer given that the audience will be more familiar with such plots.

4.7 Galleries

Galleries can be useful to browse standard plots and learn information about them.

  • from Data to Viz Gallery: An example of a gallery that provides visual examples and explanations for necessary data types is the from Data to Viz gallery
  • R Graph Gallery: The R Graph Gallery provides visual examples and templates for creating standard plots with the R programming language.
  • Python Graph Gallery: The Python Graph Gallery provides visual examples and templates for creating standard plots with the Python programming language.
  • SMART App Gallery: For more specialized galleries related to biomedical applications, consider the Substitutable Medical Apps, Reusable Technology (SMART) App Gallery which provides free and open source apps for healthcare providers and researchers to work with and visualize health data. The integrated apps describe specific data types and goals and can provide dashboards and visualizations specifically for those data types/goals.
  • Genomics Data Visualization course website: The Genomics Data Visualization course website is another more specialized gallery from the Johns Hopkins University Fan lab which provides featured visualizations meant for working with large-scale single-cell and spatially-resolved omics datasets using R.
  • Publications: Journal publications provide a myriad of example visualizations working with field specific data. Reading publications, studying their figures, and critically evaluating the design choices is a wonderful way to improve your own data visualization skills. In the near future, as you read journal articles and look at the figures, ask yourself the following questions:
    • What was the goal or overarching message this visualization was meant to communicate?
    • What was difficult for me to understand about this figure and what design choices could have made this figure easier for me to follow along?
    • What did I like best about this visualization?
    • Are these visualizations standard/common plot types or were there any new (at least to me) visualizations? What was unique about them?

4.8 Summary

There are strengths and weaknesses to every kind of plot. Some plots are better for exploratory analysis than for sharing with wide audiences. If exact quantitative values need to be considered, don’t underestimate the power of a good table. When preparing an expository visualization, focus on one message.

Choosing which plot to use may be field, context, and data dependent. Use the simplest or most mainstream plot you can! Make sure that the plot works with your data type(s), number of variables, number of observations (e.g., patients or samples), and the overall goal or research question. Use design tools to develop your visualization before switching to a more complex plot. These design tools include utilizing facets or subplots, dashed lines and text labels, and varying colors, textures, or shapes. Finally, consider browsing a visualization gallery for inspiration!