Data structures¶

Learning objectives

Understand data frames and tibbles are special lists
Know when and why to use a matrix
Understand attributes and how they affect data structure
Understand recursion in functions

graph BT
  A["Vector"]
  B["Atomic (1-D)"]
  C["Matrix (2-D)"]
  D["Array (n-D)"]
  E["List"]
  F["Data frame"]
  G["Tibble"]
  B --- A
  C --- B
  D --- B
  E --- A
  F --- E
  G --- E

All data objects are vectors. The difference between the colloquial vector (atomic vector) and list is the kinds of data they can contain: atomic vectors only store data of the same type in a single dimension (e.g., c(1, 2, 3, 4, 5)); lists are generic vectors that are able to store data of variable types and can also store other lists (i.e., recursion, see below). A further generalisation of the atomic vector are matrices (for 2-dimensional data) and arrays (for n-dimensions). While these types of data structures are seemingly restrictive (only one data mode allowed), they are how the vast majority of statistics and math is actually done!

Tabular data¶

In Introduction to R, we introduced the data frame as a form tabular data storage. Here, we will extend that knowledge by exploring some of the underlying principles of tabular data structures in R. Additionally, we will also learn how to use them effectively.

Data frames are special lists¶

Technically, data frames are special lists: they contain lists of variables of equal lengths (i.e., cases-by-variables). For practical purposes, think of each column as a vector, and that each vector has the same length that makes up the rows of the data frame. The following example exposes this construct:

code

nr <- 10
df_list <- list(
  "A" = sample(asv$ASVID, nr),
  "B" = sample(asv[[sample(colnames(asv), 1)]], nr),
  "C" = sample(c(TRUE, FALSE), nr, replace = TRUE)
)

# Print data.frame
as.data.frame(df_list)

# Check equivalence and structure
is.list(as.data.frame(df_list)) == is.list(df_list)

str(as.data.frame(df_list))
str(df_list)

arguments imply differing number of rows

If the number of elements of the vectors in the list is different, we will get the above error. This reinforces the special case of data frame as a list-based structure.

Inspect source code of data.frame

If you look closely at the source code of the data.frame function, it exposes how the input is handled as a list and how it iteratively reconstructs the list into a tabular form, assigns column and row names, and performs checks on lengths.

If we wanted to, we can also add row names for coercion:

code

rn <- sample(words, nr)
df_list <- map(df_list, \(x) {
  names(x) <- rn
  x
})

str(df_list)
as.data.frame(df_list)

The list-based nature of the data frame means it is highly amenable to column-wise iteration via map() (or it's variants). However, due to the mixed data modes commonly found in data frames, the function being "mapped" will need to have some control flow statements to deal with them correctly.

code

# Shannon's index for each sample
map(asv, \(x) {
  if (is.numeric(x)) {
    calc_alpha_diversity(x, index = "shannon")
  }
}) %>%
  unlist()

The lazy cousin: tibbles¶

Tibbles are a modern implementation of data frames developed and maintained by the folks behind tidyverse. We'll use an example to showcase how they are "lazy".

code

Create a data frame and a tibble.

df_A <- data.frame(
  a = 1:3,
  b = I(matrix(rnorm(6), nrow = 3, ncol = 2)),
  c = I(list(
    matrix(rnorm(6), nrow = 3, ncol = 2),
    matrix(rnorm(6), nrow = 3, ncol = 2),
    as.data.frame(matrix(rnorm(6), nrow = 3, ncol = 2))
    )
  )
)

tbl_A <- tibble(
  a = 1:3,
  `245` = 3:5,
  `first letters` = letters[1:3],
  b = matrix(rnorm(6), nrow = 3, ncol = 2),
  c = list(
    matrix(rnorm(6), nrow = 3, ncol = 2),
    matrix(rnorm(6), nrow = 3, ncol = 2),
    as.data.frame(matrix(rnorm(6), nrow = 3, ncol = 2))
  )
)

Variables and their names are lazily evaluatedColumns are built sequentiallyEnhanced printingStrict subsetting

This means that variables are automatically interpreted "as is", meaning we can store non-atomic vectors within tibbles without extra effort (compare lines 3-9 to lines 14-19). Furthermore, column names with characters that are illegal in base R (i.e. with spaces) are returned as is.

code

tbl_A$`245`
tbl_A$`first letters`

Consider the following ways of obtaining the third column:

code

# Construct data frame
df_B <- data.frame(
  a = 1:5,
  b = 5:9
)
df_B$c <- df_B$a * df_B$b

# Construct tibble
tbl_B <- tibble(
  a = 1:5,
  b = 5:9,
  c = a * b
)

Both multiply columns a and b in the respective tables to obtain column c. However, notice how in tibble() this is done more naturally?

code

print(asv)
print(as_tibble(asv))
print(as_tibble(asv), n = 15, width = 80)

Cleaner and informative output of a printed tibble compared to the data frame. Moreover, we can control how many lines (n) and number of characters (width) we want to see.

When using single bracket notation to subset a tibble, the result is always a tibble. This is not the case with data frames. Compare the outputs below:

code

df_A[, 1] %>% str()
tbl_A[, 1] %>% str()

Row names are not a thing in tibbles!

2-dimensional superhero: Matrices¶

The matrix is another class of tabular data object. They are the 2-dimensional generalisation of an atomic vector, which makes them simpler and stricter structures compared to data frames or tibbles. We'll explore some of the properties of matrices, and then discuss a key advantage they have over data frames.

Strict subsettingIterating on vector elements, not list of vectorsMemory usageMatrix operations

When using square bracket notation for subsetting matrices, the result is always an atomic vector if there is only one row OR column, or a matrix if more then 1 row AND column is desired.

code

str(env_matrix[, 2])
str(env_matrix[2, ])
str(env_matrix[1:5, 2:4])

This property of strict subsetting and homogeneous data mode means that obtaining values from matrices is straight-forward and has predictable outputs.

The atomic vector property of matrices means that iterations using map() is applied on each element. That is not the same as iterating on columns of a data frame.

code

map(env, length)
map(env_matrix, length)

Furthermore, the code above nicely illustrates that the total length of the data object. Line 1 gives the length of each column in a data frame, which is equivalent to the length of each vector in a list. Line 2 considers the length of each element of an atomic vector, hence the output is 1 repeated in a list of length 231, the latter of which is the number of elements in the matrix.

Use apply() to iterate correctly on a matrix

If iterating along the margins (AKA rows or columns) of the matrix is required, use the base R function apply() and specify the margin desired.

code

# Iterating along rows
apply(env_matrix, MARGIN = 1, length)
# Iterating along columns
apply(env_matrix, MARGIN = 2, length)

For storing and processing large amounts of homogeneous data, matrix should be the go to data storage class. Compared to data frames (and tibbles), matrices are more compact (occupy less RAM).

code

# Create data frame equivalent of asv_matrix
asv_dataframe <- asv[, -1]
rownames(asv_dataframe) <- asv$ASVID
# Create tibble equivalent
asv_tibble <- as_tibble(asv_dataframe, rownames = NA)

map(
  mget(ls(pattern = "asv_.*), envir = .GlobalEnv),
  object.size
)        

The atomic vector property of matrices also lends itself to powerful matrix operations. These are mathematical operations that work on entire matrices (recall vectorised functions), and they are the basis of much of the underlying statistical processes in R (e.g., matrix multiplication %*%, transposition t(), eigenvector eigen(), etc.). For most statistical analyses, such as calculating correlations cor(), covariance cov(), distances dist(), principle component analysis princomp() etc., the first thing that those functions do is coerce the input data into a matrix. Beyond R, entire branches of STEM rely on matrix operations to obtain numerical solutions to complex, high-dimensional problems (e.g., quantum physics, multiple sequence alignments, the color filter applied to your Instagram posts).

Due to the fundamental nature of matrix operations, they are implemented at the C/C++ level (available in the R source code or it's GitHub mirror). Higher level R functions (i.e., functions exposed to users) typically call those as soon as possible. This also means that users can exploit highly efficient and optimised algorithms packaged in Basic Linear Algebra Subprograms (BLAS) and Linear Algebra PACKage (LAPACK). Popular options include Intel's Math Kernel Library (optimised for Intel processors) and OpenBLAS. These software libraries provide substantial improvements in terms of computational efficiency, but are difficult to install for the average user (Thankfully, NeSI's R come pre-linked with BLAS and LAPACK libraries).

Attributes¶

Think of attributes as metadata to the object. Some attributes determine the structure of the data so that R can interpret them properly, other attributes are attached as metadata or for the purposes of provenance. Every time we perform some action on an object, the object's class attribute will be examined prior to evaluation as some operations are restricted to certain classes. Besides class(), other basic attributes include names(), dimensions dim(), and dimension names dimnames(). In the context of matrices, dim() and dimnames() (list of row and column names, in that specific order) are important attributes that differentiates it from a 1-dimensional atomic vector.

To extract and modify attributes:

code

# Create example data
env_scaled <- scale(env_matrix)

# Inspect attributes
attributes(env_scaled)

# Extract attributes
attributes(env_scaled)$`scaled:scale`

As observed, calling attributes() returns a list object of the available attributes from which we can inspect. As it is a list, we can modify and add attributes by assigning it like a list-element.

code

attributes(env_scaled)$method <- "Z-score standardisation"

# Inspect attributes
attributes(env_scaled)

Recursive objects: Functions¶

In the Data Prelude lesson of Introduction to R, we were introduced to the list class of data storage and how to manipulate them. Lists are an example of a recursive data object. That means that they can store other objects of different modes and classes, thus allowing for complex and hierarchical structures through nesting (think Russian dolls). To check if something is recursive, we can use the predicate function is.recursive().

Something that may not be immediately intuitive is that functions are also recursive objects. In this context, recursion means something slightly different: a recursive function is a function that calls itself as part of it's source code/expression. For example, a factorial function applicable to integers:

\[ n! = \prod_{i=1}^{n} i \]

code

calc_factorial <- function(n) {
  if (n > 1) {
    n * calc_factorial(n - 1)
  } else {
    1
  }
}

# Test that it works
calc_factorial(4)
prod(1, 2, 3, 4)

This function is the first time calc_factorial() is defined, but it calls itself in its expression! This is one of the examples of a recursive function. The recursive nature might be more easily understood as a diagram:

graph TD
  F4("calc_factorial(4)")
  F3("calc_factorial(3)")
  F2("calc_factorial(2)")
  F1("calc_factorial(1)")

  F4 --> 4
  F4 --> F3
  F3 --> 3
  F3 --> F2
  F2 --> 2
  F2 --> F1
  F1 --> 1

At the end of the diagram, we multiply all the tips (results) to get the solution.

S3 and S4 objects¶

In R, objects obey different object-oriented programming systems (namely S3, S4, and R6) to define what the objects are. This primarily affects the object's class and how R interprets and processes the object. A large majority of the objects in R follow the S3 system.

The major difference between the two is that S4 is formally defined and stricter than S3. S4 is generally preferred when building large packages or suites of interacting packages with complex methods so that outputs and methods are consistent and that all code contributors have the same "vocabulary" for handling objects. This is the reason that all packages in Bioconductor are written using the S4 system in mind.

An additional item in S4 is slots, where slot values (think object property) are retrieved using package-specific accessor functions or the symbol @. The details that differentiates the systems are too detailed for the average R user and will not be covered here. For further reading, we refer you to the Object-Oriented Programming section of Advanced R