Violin Plot

Overview #

A violin plot is used to display the distribution of numerical variables. The width along the cross section of any part of the plotted violin represents how many data points there are within that given section.

A single violin plot can display the distribution of numerical variables for multiple categories.

Data #

A violin plot requires at least one categorical variable and one numerical variable.

R #

Let’s make a violin plot in R with the ggplot2 package. We’ll also use tooling from tidyverse to manipulate the data.

library(tidyverse)
library(ggplot2)

For this example, we’ll use the built in diamonds dataset.

## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ℹ 53,930 more rows

Since a violin plot requires a categorical variable and a numerical variable, let’s go with the cut (categorical) and the price (numerical).

Let’s plot a very simple violin plot using cut and price.

ggplot(
  data = diamonds,
  aes(
    x = cut,
    y = price,
    fill = cut
    )
  ) +
  geom_violin()

Let’s break that down a bit.

data = diamonds defines the dataset we’re using here.

aes(...) defines the aesthetic mappings. Within aes, we’re basically saying that the x-axis should be used for the cut, the y-axis should be used for the price, and we’ll fill the plots by color based on the cut.