Bubble Plot

Overview #

A bubble plot is simply an enhanced version of a scatter plot.

A simple, flat scatter plot typically only utilizes two numerical fields and conveys data based on the position of points.

A bubble plot on the other hand utilizes a third numerical field. In a bubble plot, the size of the points change based on the value of the third numerical field.

Data #

A bubble plot requires at least three numerical fields:

  • A numerical field for the horizontal x-axis
  • A numerical field for the vertical y-axis
  • A numerical field for the size of the point (or in this case, the bubble)

R #

Let’s mock up a simple example using the tribble function in R’s tidyverse package:

library(tidyverse)

A simple dataset example that would work with a bubble plot could look like this:

example_dat <- tribble(
  ~age, ~"years of work", ~income,
  20, 3, 50000,
  30, 5, 60000,
  35, 7, 100000
)

example_dat
## # A tibble: 3 × 3
##     age `years of work` income
##   <dbl>           <dbl>  <dbl>
## 1    20               3  50000
## 2    30               5  60000
## 3    35               7 100000

A simple scatter plot using that example data would merely be points plotted along an x- and y-axis with no size scaling.

example_dat %>%
  ggplot(
    aes(
      x = age,
      y = `years of work` # not that these are back-ticks, not commas. Necessary because of the spaces in the field name
    )
  ) +
  geom_point()

This view actually doesn’t tell a reader anything all that useful. All a reader can take from this is that there are three data points representing individuals, with details about their age and the years of work experience.

We can add map the size of the points to income to convey a sense of how income changes with different combinations of age and years of work experience:

example_dat %>%
  ggplot(
    aes(
      x = age,
      y = `years of work`, # not that these are back-ticks, not commas. Necessary because of the spaces in the field name
      size = income
    )
  ) +
  geom_point()

A more interesting example #

The late Hans Rosling gave a very inspiring Ted Talk about different countries, life expectancy, population size, and GDP per capita. In his presentation, he heavily features bubble plots.

The same dataset that he used can be accessed through the Gapminder package.

# install.packages("gapminder") # if you don't already have the package installed
library(gapminder)

Let’s take a look at the dataset.

gapminder
## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

Rosling’s presentation includes animated bubble plots, where time is featured as another dimension. For now, we’ll focus on a simpler, non-animated plot that only shows one year. Let’s arbitrarily focus on 2007, the final year in the included gapminder dataset. The plot we’ll make is one that utilizes three numerical dimensions from the dataset:

  • GDP per capita
  • life expectancy
  • population size

As a bonus, let’s throw in the continent field, which can be mapped to color.

gapminder %>%
  filter(year == 2007) %>%
  ggplot() +
  geom_point(
    aes(
      x = pop,
      y = gdpPercap,
      size = lifeExp,
      color = continent
    )
  )

That works, but it’s difficult to read - there’s a lot of point overlap, there’s a huge cluster in the bottom right that makes separation difficult, the labels are not readily human-readable, and without labels there’s really no good contextual framing for a reader. Let’s fix those issues.

gapminder %>%
  filter(year == 2007) %>%
  ggplot() +
  geom_point(
    aes(
      x = pop,
      y = gdpPercap,
      size = lifeExp,
      color = continent
    ),
    alpha = .4 # makes points transparent. Values can be set between 0 to 1
  ) +
  scale_x_continuous(label = scales::comma) +
  scale_y_continuous(label = scales::comma) +
  labs(
    title = "Life Expectancy by GDP Per Capita and Population Size in 2007",
    x = "Population",
    y = "GDP Per Capita",
    caption = "Source: Gapminder"
  )

That’s better, but there’s still a huge cluster towards the left. The issue is there are many points that are pretty similar, and then there are some population size outliers that sit pretty far away - India and China. We can fix this by replacing a standard x-axis linear scale with a logarithmic scale. Logarithmic scales are useful for displaying data with hugely divergent scales.

In this case, we’ll use a log10 transformation. Specially, we’ll replace scale_x_continuous(label = scales::comma) with scale_x_log10(label = scales::comma) to change the x-axis scale.

gapminder %>%
  filter(year == 2007) %>%
  ggplot() +
  geom_point(
    aes(
      x = pop,
      y = gdpPercap,
      size = lifeExp,
      color = continent
    ),
    alpha = .4 # makes points transparent. Values can be set between 0 to 1
  ) +
  scale_x_log10(label = scales::comma) +
  scale_y_continuous(label = scales::comma) +
  labs(
    title = "Life Expectancy by GDP Per Capita and Population Size in 2007",
    x = "Population",
    y = "GDP Per Capita",
    caption = "Source: Gapminder"
  )

There’s more that we can do with this plot, including faceting, labeling, and applying some more theming, but what we know have is a minimum viable plot that is readable and conveys useful information with.