Alluvial Plot

Overview #

An alluvial plot is a way to show proportions of observations across different categories.

Visually, an alluvial plot is made up of stacked bars, and between those bars are links.

Each set of stacked bars represents a categorical field. The separations within the stacked bars represent different categories.

These stacked bars are more precisely referred to as strata (singular, stratum), and the vertical stacked bar themselves are called axis.

The links between the bars conveys the set of observations shared between one categorical field and the next.

The entirety of these links (extending all the way from the left to the right of the plot) are more properly referred to as alluvia (alluvium in singular form).

The segments of these alluvia that connect axis to axis are referred to as flows. This of these flows as shorter sugments of longer alluvia.

The heights of the bars and the widths of the links represent the number of observations. Note that the total height remains consistent all along the plot, moving from left to right.

That’s a lot of words so far. Put simply, an alluvial plot shows proportions across different categories.

Example #

A very popular example of an alluvial plot is the composition of Titanic passengers, broken down by class, gender, and survival.

Here’s a subset of that data:

Class	Sex	Age	Survived	Freq
1st	Male	Child	No	0
2nd	Male	Child	No	0
3rd	Male	Child	No	35
Crew	Male	Child	No	0
1st	Female	Child	No	0
2nd	Female	Child	No	0
3rd	Female	Child	No	17
Crew	Female	Child	No	0
1st	Male	Adult	No	118
2nd	Male	Adult	No	154

In this dataset, the fields CLASS, SEX, AGE, and SURVIVED are categorical fields. FREQ is a numerical field.

An alluvial plot of this data might look like this:

## Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Compared to Sankey Diagram #

Visually, an alluvial plot looks quite a bit like a sankey diagram.

They both have bars and links of varying widths. On first glance, it’s very easy to mistake one for the other.

In fact, there’s a lot of sloppiness in the usage of the terms “alluvial” and “sankey”. The two terms are often used interchangeably, when they are in fact very different visualizations.

An alluvial plot is used to convey information about categories, whereas a sankey diagram is used to convey information of flows.

In an alluvial plot, the ordering of the bars do not necessarily matter as there is no logical sequencing to the data. In the Titanic data for instance, there is no reason why the CREW field needs to come before the SURVIVED field, and vice versa.

In contrast, order does matter in a Sankey diagram, since there is a logical sequencing to how one observation moves from one state to the next.

When to use #

Use an alluvial plot when you want to show the proportion of observations that are shared across different categorical fields.

Alluvial plots are good for conveying approximate proportions. They’re less effective for communicating precise measures.

Data #

An alluvial plot requires at least two categorical fields and one numerical field.

There can be more than two categorical fields, but if there are too many, the plot will look pretty busy and could be hard to read.

The data can be structured as lodes, but we won’t talk about that here. For more details, check out the ggalluvial vignette.

R #

My preferred way of generating alluvial plots in R is with the ggalluvial package.

ggalluvial builds upon ggplot2, which means we can use the different pieces of tooling available through ggplot2 as well.

# install.packages("ggalluvial") # run this if the package hasn't already been installed
library(ggalluvial)

Let’s walk through how to build the Titanic plot from above.

Here’s the Titanic data.

str(Titanic)

##  'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"

head(Titanic)

## , , Age = Child, Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   387     89
##   Crew  670      3
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    57    140
##   2nd    14     80
##   3rd    75     76
##   Crew  192     20

Note that the Titanic data is in a table format. Let’s work with as as a data frame instead.

head(as.data.frame(Titanic))

##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0

First, let’s draw the alluvia. We’ll arbitrarily order the categories to go from Class, to Sex, to Age, and we’ll have the widths map to the frequency (Freq). Let’s also map the color to survival status.

as.data.frame(Titanic) %>%
  ggplot(
    aes(y = Freq, axis1 = Class, axis2 =Sex, axis3 = Age)
  ) +
  geom_alluvium(aes(fill = Survived))

The axis parameters in the aesthetic mapping define the ordering of the categories.

That’s visually interesting, but not very functional. Let’s add some strata with labels.

as.data.frame(Titanic) %>%
  ggplot(
    aes(y = Freq, axis1 = Class, axis2 =Sex, axis3 = Age)
  ) +
  geom_alluvium(aes(fill = Survived)) +
  geom_stratum(alpha = .5) +
  geom_label(stat = "stratum", aes(label = after_stat(stratum)))

A little tip: I like making the strata semi-transparent with the alpha parameter to allow the alluvia to peak out from behind.

For more details about the stat = "stratum" bit, check out the documentation here.

This is now functional, but let’s clean it up some by adding some more labels and removing extraneous bits.

as.data.frame(Titanic) %>%
  ggplot(
    aes(y = Freq, axis1 = Class, axis2 =Sex, axis3 = Age)
  ) +
  geom_alluvium(aes(fill = Survived)) +
  geom_stratum(alpha = .5) +
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +
  scale_x_discrete(limits = c("Class", "Gender", "Age")) + # change x-axis labels
  theme(
    axis.ticks.y = element_blank(), # remove tick marks on the vertical
    axis.text.y = element_blank(), # remove numbers from the vertical
    panel.background = element_blank(), # remove the gray background
    panel.grid.major = element_blank(), # remove the major grid lines
    panel.grid.minor = element_blank() # remove the minor grid lines
  ) +
  labs(
    title = "Passengers of the Titanic" # add a plot title
  )

And that’s it. We now have a fairly clean and functional alluvial plot.

Conclusion #

Alluvial plots are fairly sophisticated, and the amount of customization possible can be pretty overwhelming.

Despite their sophistication, I would say that alluvial plots aren’t terribly functional. They suffer from the same flaw as pie charts or donut charts in the sense that the form presented in alluvial plots aren’t great for humans to draw accurate comparisons.

They can be aesthetically pleasing to look at though, and sometimes approximately communicating data in a pretty form is good enough.