Data manipulation for ggplot2

Why format matters

ggplot2 is built around long format data. In long format, each row represents a single observation, and variables are stored in columns. This structure maps naturally onto ggplot2’s aesthetic system: one column per aesthetic (x, y, colour, fill, etc.).

We’ll use gene count data as an example.

A long format table looks like this:

sample gene count
sample1 BRCA1 142
sample1 TP53 87
sample1 MYC 93
sample2 BRCA1 203
sample2 TP53 61
sample2 MYC 64

Each gene-by-sample combination gets its own row. This makes it straightforward to map gene to colour, sample to the x-axis, and count to the y-axis.

The problem: a lot of data is naturally wide

In contrast, data such as RNA-seq count matrices are typically stored in wide format, where each sample is its own column:

gene sample1 sample2 sample3 sample4
BRCA1 142 203 178 95
TP53 87 61 104 77
MYC 93 64 401 356

This format is convenient for storage and inspection, but ggplot2 cannot directly use it. To plot this data, we need to reshape it from wide to long.

Pivoting with tidyr

The tidyr package (part of the tidyverse) provides two key functions for reshaping data:

  • pivot_longer() : wide → long (the one we use most)
  • pivot_wider() : long → wide
library(tidyverse)

Creating a small example counts matrix

counts_wide <- data.frame(
  gene    = c("BRCA1", "TP53", "MYC", "GAPDH"),
  sample1 = c(142, 87, 312, 1045),
  sample2 = c(203, 61, 289, 987),
  sample3 = c(178, 104, 401, 1123),
  sample4 = c(95,  77, 356, 1034)
)

counts_wide
   gene sample1 sample2 sample3 sample4
1 BRCA1     142     203     178      95
2  TP53      87      61     104      77
3   MYC     312     289     401     356
4 GAPDH    1045     987    1123    1034

Note: The gene column is a column, not rownames. If you read in your data with the gene names as rownames, you may first need to run counts_wide <- counts_wide |> rownames_to_column(var = "gene") to move the rownames to a column.

pivot_longer(): wide to long

pivot_longer() collapses multiple columns into two: one for the former column names (here, sample IDs) and one for the values (counts).

counts_long <- counts_wide |>
  pivot_longer(
    cols      = -gene,                  # collapse all columns except gene
    names_to  = "sample",               # new column for old column names
    values_to = "count"                 # new column for the values
  )

counts_long
# A tibble: 16 × 3
   gene  sample  count
   <chr> <chr>   <dbl>
 1 BRCA1 sample1   142
 2 BRCA1 sample2   203
 3 BRCA1 sample3   178
 4 BRCA1 sample4    95
 5 TP53  sample1    87
 6 TP53  sample2    61
 7 TP53  sample3   104
 8 TP53  sample4    77
 9 MYC   sample1   312
10 MYC   sample2   289
11 MYC   sample3   401
12 MYC   sample4   356
13 GAPDH sample1  1045
14 GAPDH sample2   987
15 GAPDH sample3  1123
16 GAPDH sample4  1034

Now each row is one gene–sample observation – exactly what ggplot2 expects.

pivot_wider(): long back to wide

If you need to go the other direction (e.g., to export a count matrix or join tables), use pivot_wider():

counts_wide_again <- counts_long |>
  pivot_wider(
    names_from  = "sample",
    values_from = "count"
  )

counts_wide_again
# A tibble: 4 × 5
  gene  sample1 sample2 sample3 sample4
  <chr>   <dbl>   <dbl>   <dbl>   <dbl>
1 BRCA1     142     203     178      95
2 TP53       87      61     104      77
3 MYC       312     289     401     356
4 GAPDH    1045     987    1123    1034

Adding sample metadata

In a real RNA-seq analysis, you’ll also have a metadata table that maps sample IDs to experimental groups (e.g., treatment, timepoint, tissue). After pivoting to long format, you can join metadata in using left_join().

# A simple metadata table
metadata <- data.frame(
  sample    = c("sample1", "sample2", "sample3", "sample4"),
  condition = c("control", "control", "treated", "treated"),
  timepoint = c("T0", "T0", "T24", "T24")
)

metadata
   sample condition timepoint
1 sample1   control        T0
2 sample2   control        T0
3 sample3   treated       T24
4 sample4   treated       T24
# Join metadata onto the long counts table
counts_annotated <- counts_long |>
  left_join(metadata, by = "sample")

counts_annotated
# A tibble: 16 × 5
   gene  sample  count condition timepoint
   <chr> <chr>   <dbl> <chr>     <chr>    
 1 BRCA1 sample1   142 control   T0       
 2 BRCA1 sample2   203 control   T0       
 3 BRCA1 sample3   178 treated   T24      
 4 BRCA1 sample4    95 treated   T24      
 5 TP53  sample1    87 control   T0       
 6 TP53  sample2    61 control   T0       
 7 TP53  sample3   104 treated   T24      
 8 TP53  sample4    77 treated   T24      
 9 MYC   sample1   312 control   T0       
10 MYC   sample2   289 control   T0       
11 MYC   sample3   401 treated   T24      
12 MYC   sample4   356 treated   T24      
13 GAPDH sample1  1045 control   T0       
14 GAPDH sample2   987 control   T0       
15 GAPDH sample3  1123 treated   T24      
16 GAPDH sample4  1034 treated   T24      

Now each row carries both the count value and the experimental context for that sample – making it easy to map condition or timepoint to a ggplot2 aesthetic.

It’s a good idea to use one of the dplyr join() functions here (left_join typically being the most popular) to join the metadata to the counts long object. Left join will keep all observations in the “x” or left dataframe (our counts dataframe) and will automatically repeat values from the “y” dataframe (the metadata) where they match multiple times – this is important here as the same sample ID is repeated multiple times in our counts long object for each unique gene. We use by = "sample" argument as the column that exists in both dataframes and for most biological datasets we can reasonably expect to have metadata recorded by sample.

Handy tip: name the column the same in each dataframe to join it as the code above. If they are named differently (e.g., counts is “sample” and metadata is “samplename”), you can instead write the function as below, where the name on the left corresponds to the column name in the first dataframe and the name on the right corresponds to the column name in the second dataframe:

counts_annotated <- left_join(counts_long, metadata, by = c("sample" = "samplename"))

Plotting the result

With the data in long format and metadata attached, we can go straight into ggplot2:

counts_annotated |>
  ggplot(aes(x = condition, y = count, fill = condition)) +
  geom_boxplot() +
  theme_bw() +
  facet_wrap( ~ gene, scales = "free_y")