Sankey Diagram

Overview #

A sankey diagram is a visualization that can be used to represents flows that move through stages.

A sankey diagram requies three different types of information: * Nodes – these are the stages along a flow. * Links – these are the connections between different nodes, and shows where the flow goes. Links are also scaled by some value to communicate the scale of flows. * Stage – these are details about relative positioning of nodes. There can be multiple nodes along a given stage.

The concept of the Sankey diagram was created by engineer Matthew Henry Phineas Riall Sankey. The original Sankey diagram was used to show energy flow in a steam engine.

With a sankey diagram, there’s direction and a logical ordering to how the different parts need to be sequenced.

Charles Minard’s map of Napoleon’s 1812 campaign against Russia is probably the most famous rendition of what we now think of a sankey diagram. In this example, the nodes are major geographic landmarks or moments, and the links represent the size of Napoleon’s army.

The different paths of a sankey diagram can merge, split, and loop backwards.

When to use #

A sankey diagram can be used to show how something flows through different stages.

Visually, a sankey diagram looks very similar to an alluvial plot, but they are functionally very different. Whereas a sankey diagram conveys information about flow, an alluvial plot displays information about categories.

Data #

The data for a sankey diagram is essentially graph (or network) data.

In particular, the data should be structured as an edgelist, with a “from” field and a “to” field. The “from” and “to” naming is not a strict requirement.

The elements in the “from” and “to” fields are essentially nodes.

“From” can also be referred to as “source”, and “to” can be referred to as “target”.

Certain implementations expect a numerical value for each record of the edgelist, which maps to the thickness of the links.

Depending on the implementation, there may have to be some detail about the stages of the “from” and “to” nodes, where the stages again represent the relative sequencing of nodes. See the ggsankey implementation below.

R #

There are a number of different packages available within the R ecosystem that can be used to generate sankey diagrams. They each differ slightly in terms of data structures and mechanizations to build Sankey diagrams.

networkD3 #

The networkD3 package is probably the most established sankey diagram implementation in the R ecosystem.

The output from networkD3 is interactive.

The data for networkD3 can be structured as an edgelist or as a matrix. I personally prefer working with edgelists.

from to value
A C 1
A D 1
B D 1
D E 2

We also need to pull out the distinct nodes.

##   node
## 1    A
## 2    B
## 3    D
## 4    C
## 5    E

networkD3 identifies links using a numerical identifier rather than a name or string. We can convert the original node names to a numerical identifier using the base R match() function.

edgelist$num_from <- match(edgelist$from, nodes$node)-1
edgelist$num_to <- match(edgelist$to, nodes$node)-1

kable(edgelist)
from to value num_from num_to
A C 1 0 3
A D 1 0 2
B D 1 1 2
D E 2 2 4

Note the -1. This is necessary to get the index numbering correct.

Now that we have the data properly structured, we’re ready to generate a sankey diagram.

sankeyNetwork(
  Links = edgelist,
  Nodes = nodes,
  Source = "num_from",
  Target = "num_to",
  Value = "value",
  NodeID = "node",
  sinksRight = FALSE, # boolean, if TRUE, the last nodes are moved to the right
  height = 600, # height in pixels
  width = 800, # width in pixels
  fontSize = 20
)

Bear in mind that this is interactive. You can hover over the links for more details, or drag the nodes around.

plotly #

The plotly package offers the ability to generate sankey diagrams in a manner very similar to networkD3.

For more details, check out the plotly documentation.

ggsankey #

Sankey diagrams can be generated using the ggsankey package by David Sjoberg.

As the name suggests, this builds upon the ggplot2 system.

There is potential ggsankey, but at this time, it does feel like this package requires some more polishing.

# devtools::install_github("davidsjoberg/ggsankey")
library(ggsankey)

The data format for ggsankey is very particular. Specifically: * Each row must have four fields: a “from” node, a “to” node, a stage for the “from” node, and a stage for the “to” node * The stage details must be factors, where the factor levels are ordered sequentially * The total value of links passing through each stage should be equivalent. For instance, if there are three rows appearing in one stage, then there should be three rows appearing in the next stage as well. * Nodes in the terminal stage (i.e., last stage) should be presented as another row, but pointing at NA node and NA stage

Here’s an example dataset:

ggsankey does include a make_long() helper function to convert wide data into a properly formatted long data structure.

Let’s turn that example data into a sankey diagram.

ggplot(
  example,
  aes(
    x = from_stage,
    next_x = to_stage,
    node = from,
    next_node = to,
    fill =  from
  )
) +
  geom_sankey() 

We can make the flows a bit transparent.

ggplot(
  example,
  aes(
    x = from_stage,
    next_x = to_stage,
    node = from,
    next_node = to,
    fill =  from
  )
) +
  geom_sankey(flow.alpha = .5) 

Let’s also apply some labels and drop the legend.

example %>%  
  ggplot(
    aes(
      x = from_stage,
      next_x = to_stage,
      node = from,
      next_node = to,
      fill =  from
    )
  ) +
  geom_sankey(flow.alpha = .5) +
  geom_sankey_label(aes(label = from)) +
  theme(
    legend.position = "none"
  )

Let’s also use the included theme_sankey theme and clean up the labeling.

example %>%  
  ggplot(
    aes(
      x = from_stage,
      next_x = to_stage,
      node = from,
      next_node = to,
      fill =  from
    )
  ) +
  geom_sankey(flow.alpha = .5) +
  geom_sankey_label(aes(label = from)) +
  theme_sankey() +
  theme(
    legend.position = "none"
  ) +
  labs(
    x = NULL,
    y = NULL
  )

Resources #