Iterations¶

Learning objectives

Construct for loops
Apply functions iteratively using map_*()
Make "good enough" design choices when iterating

Say again?¶

Here, we leave behind copy-pasting and really start to accelerate workflows. To iterate means to do something repetitively on different objects. All programming languages have at least one implementation of iterations (arguably the main reason to do anything on a computer). In R, this is done in one of two ways:

for loop

This is a ubiquitous method across many programming languages with slight syntax differences between languages.

Applying functions using map()

"Applies" a function along a vector/list with more concise code. In other words, the function supplied is recycled for every element in the vector/list. The iteration is happening in the background.

This episode will cover how to use both methods via a pairwise correlation analysis. The goal of the analysis is to determine what environmental factors might be correlated with sequence variant, and consequently, taxa, distribution. Finally, we will discuss some design choices with regards to iterations based on data properties.

`for` loops¶

A for loop is the most common way to iterate through objects. All for loops look like the following:

for (<iterator> in <vector>) {

  <expressions>

}

Here is what each part of the pseudocode above means:

Code	Description
`for`	Initialises the loop.
`<iterator>`	A variable that represents each element of the object being iterated on
`<vector>`	An object to iterate
`<expressions>`	Lines of code that does something per iteration

Here is a simple working example:

code

for (i in 1:10) {
  print(i * 2)
}

The above example iterated through numbers 1:10, where each number was represented by i (i.e., for (i in 1:10)), then multiplied each number by 2, then printed the result on the console (i.e., {print(i * 2)}).

The syntax of a for loop is remarkably flexible and can be used in a variety of scenarios. Powerful as it is, there are a few best practices that we need to be mindful of. We will get to them in our practical exercise section below.

Handling outputs¶

Consider the following loop:

code

for (i in 1:10) {
  Ni <- i * 2
}

We constructed a loop to multiply each number by 2, but we only have 1 number as a result. This is because we assigned Ni with a new value with each iteration. To overcome this, we need to prepare an output vector that can store the result of each iteration.

code

Ni <- numeric(10) # Produces a vector of 0s

for (i in 1:10) {
  Ni[i] <- i * 2
}

Thy shall not grow vectors

It is possible to create an empty vector and then growing it by appending values from a for loop.

code

Ni <- c()

for (i in 1:10) {
  Ni <- c(Ni, i * 2)
}

However, this kind of construct is not recommended for most use cases. If an iteration generates large number of elements, there might not be sufficient RAM to hold it and the entire R session will stall and terminate.

while loops

As seen above, for loops require a defined range for iteration. There are scenarios where we want an iteration to run for as long as required until some condition is reached. In this case, we can use the while loop.

while(<predicate>) {

  <statement>

}

As long as the expression returns a TRUE, the statement will continue to run. This is common in numerical optimisation where procedures are required for model fitting.

The following code block illustrates how while loops are used in an iterative process to obtain small number solutions to the Collatz conjecture.

code

collatz <- function(n) {
  # Output
  PATH <- numeric(10)

  # Initiate index
  i <- 1

  # Expand output stepwise
  if (i == length(PATH)) {
    PATH <- c(PATH, numeric(10))
  }

  # Exit
  if (n == 1) return(0)

  # Test
  while (n != 1) {
    n <- if_else(n %% 2 == 0, n / 2, (3 * n) + 1)
    PATH[i] <- n
    i <- i + 1
  }

  PATH[PATH > 0]
}

# Solutions for numbers 2 to 50
collatz_res <- vector(mode = "list", length = 100 - 1)
for (numbers in 2:100) {
  collatz_res[[numbers - 1]] <- collatz(numbers)
  names(collatz_res)[numbers - 1] <- numbers
}

In lines 19 to 23, we applied a while loop to ensure that collatz() continues to be applied until x == 1, which marks the end of the loop.

The `map()` family¶

While for loops are flexible and powerful, writing them can sometimes result in long expressions. This hampers others (including future self) from auditing/reading the code. To help alleviate some of this, we can use the purrr::map() family of functions. These functions aim to make code more function-oriented and makes code more concise. The table below shows some commonly used ones from both families:

Function	Description
`map()`	Iterate over a vector.
`map2()`	Iterate over 2 vectors.
`pmap()`	Iterate over multiple vectors.

The base forms above always return a list regardless of coercion within the "mapped" function. However, each function also have modifiers that can output different modes of data:

Modifier	Return values
`_int()`	Integer (numeric whole numbers)
`_dbl()`	Double (numeric with decimals)
`_chr()`	Character
`_lgl()`	Boolean (`TRUE` or `FALSE`)

These functions are designed to produce predictable outputs (or, as written in the help pages, "die trying"). They all have the following syntax:

# One vector
map(vector, <function,character,integer>)

# Two vectors
map2(vector_1, vector_2, <function,character,integer>)

# List of multiple vectors
pmap(list(vector_1, vector_2, vector_3, ...), <function,character,integer>)

If a character or integer is supplied in place of a function, it will extract the component by name or position, respectively.

We'll make some examples to get a grasp of what map() does:

code

# Something to iterate
b <- list(
  some_characters = fruit[1:10],
  some_logical = sample(c(TRUE, FALSE), 10, replace = TRUE),
  some_integers = rpois(10, 2),
  some_doubles = rcauchy(10)
)

map(b, \(x) paste(x, collapse = " "))

We wrote a simple expression that concatenates elements of vectors in list b and prints them out as one string per list element. Here's another example that uses a variation of map_*() to calculate the coefficient of variation among biological replicates:

code

example_pmap <- pmap_dbl(
  list(asv$AS1A1, asv$AS1A2, asv$AS1A3),
  \(a, b, c) {
    values <- c(a, b, c)
    sd(values) / mean(values)
  }
)

str(example_pmap)

Notice that it is a numeric vector instead of a list as we used the _dbl modifier. If we tried to use a different modifier that does not match the output mode, it will error out and will not produce an output.

Try different modifiers and see what are the outcomes. Include coercion inside the function and see if you can produce different results.

Sidestepping for side effects

The above map_*() family of functions all produce an output per element. However, if your primary interest is to generate side effects (i.e., plots, console outputs via cat(), saving files, etc.), it is better to use purrr's family of functions: walk(). Like map_*(), it comes in variations for two (walk2()) or more (pwalk()) vectors. Compare the following outputs:

fn_glue <- \(x, nm) {
  str_glue("Vector named {nm} is a {class(x)}")
}

map2(b, names(b), \(x, nm) cat(fn_glue(x, nm), "\n"))
walk2(b, names(b), \(x, nm) cat(fn_glue(x, nm), "\n"))

Let's compare and contrast use cases of for and map() using the exercise below.

Pairwise correlation analysis¶

Community data is inherently multivariate. These are counts of organisms across different habitats/environments. A useful piece of information is how some organisms are distributed depending on the characteristics of their habitat, e.g., nutrient supply, sediment permeability, salinity, elevation, etc. Here, correlation analysis can help us understand if some of these characteristics underlie an organism's distribution. At a higher level, large correlation analyses can help identify some conditions that are more suited to groups of organisms (correlations can be a precondition of cluster analysis). From there, other analyses or more robust sampling designs can be used to test the validity of our observations.

Before delving into the analyses, it is important to filter out data that may not be important. This means removing ASV that are only sparsely present. For this exercise, we will retain ASVs that are present in more than half of all samples. We will also convert the data frames of asv and env into matrices for ease of processing.

code

# Filter data
retain_threshold <- nrow(env) / 2
asv_retained <- rowSums(ifelse(asv[, -1] > 0, 1, 0)) > retain_threshold
asv_subset <- asv[asv_retained, ]

# Convert to matrices
asv_subset_matrix <- as.matrix(asv_subset[, -1])
rownames(asv_subset_matrix) <- asv_subset$ASVID

env_matrix <- as.matrix(env[, -1])
rownames(env_matrix) <- env$sample

Table-to-table using nested `for` loops¶

In this method, we will:

Create matrices for storing correlation coefficients and their P-values.
Use nested for loops to iterate through the rows of asv and columns of env.
Calculate correlation coefficients and significance per iteration
Assign the values to relevant matrices

Pre-assign outputs

code

# Assign output
rho_matrix <- matrix(
  nrow = nrow(asv_subset_matrix),
  ncol = ncol(env_matrix),
  dimnames = list(
    asv_subset$ASVID,
    colnames(env)[-1]
  )
)
p_matrix <- rho_matrix

str(rho_matrix)
str(p_matrix)

Run correlation analyses via nested loop

code

# The outer loop that loops through ASVs (rows)
for (i in 1:nrow(asv_subset_matrix)) {

  abundance <- as.numeric(asv_subset_matrix[i, ])

  # Nested loop
  for (j in 1:ncol(env_matrix)) { # Looping over environmental measures (columns)

    env_value <- as.numeric(env_matrix[, j])

    # Correlation tests
    test <- cor.test(abundance, env_value)

    # Assign values to output matrices
    rho_matrix[i, j] <- test$estimate
    p_matrix[i, j] <- test$p.value

  }

  # Print some output
  if (i %% 50 == 0) cat(str_glue("{i} ASVs done"), "\n")

  if (i == nrow(asv_subset_matrix)) cat("Completed!\n")

}

Code breakdown

Code	Description
`for (i in 1:nrow(asv_subset_matrix)) {`	Initiates the outer `for` loop to iterate over the rows of `asv_subset_matrix`
`abundance <- as.numeric(asv_subset_matrix[i, ])`	Coerces the i^th row of `asv_subset_matrix` into a numeric vector, then assigns it to an object called `abundance`. This object is temporary and is over-written with each iteration.
`for (j in 1:ncol(env_matrix)) {`	Initiates the inner loop to iterate over the columns of `env_matrix`
`env_value <- as.numeric(env_matrix[, j])`	Like for `abundance`, coerces the j^th column of `env_matrix` into a numeric vector, then assigns it to `env_value`, a temporary object.
`test <- cor.test(abundance, env_value)`	This is the workhorse function that calculates the Pearson's correlation coefficient and performs hypothesis testing (i.e. is the coefficient different from 0).
`rho_matrix[i, j] <- test$estimate` `p_matrix[i, j] <- test$p.value`	Assigns the correlation coefficient and P-value to it's relevant position in the results matrices `rho_matrix` and `p_matrix`.
`if (i %% 50 == 0) cat(str_glue("{i} ASVs done"), "\n")` `if (i == nrow(asv_subset_matrix)) cat("Completed!\n")`	Prints some informative outputs for us to track it's progress at intervals of every 50 ASVs (the outer loop) until it is complete.

Check output

code

head(rho_matrix)
head(p_matrix)

cor.test()

Find out what cor.test() returns by reading the help page

Cartesian product using `expand.grid()` and `map2()`¶

In the above method, we generated two matrices that stored different, but related, information. However, we could also create a single table that contained both coefficient and P-value. To achieve this, we will assign all outputs that come from cor.test() into one table that consists of all possible combinations between colnames(env_matrix) and rownames(asv_subset_matrix) (i.e., the Cartesian product).

Pre-assign output

code

corr_results <- expand.grid(
  "ASVID" = rownames(asv_subset_matrix),
  "env_var" = colnames(env_matrix)
)

head(corr_results)

Iterate through outputs

For this, we use the tidyverse dialect to perform the entirety of the analysis. We will use map2() to iterate over both columns simultaneously to mutate() columns that will store the results.

code

corr_results <- as_tibble(corr_results) %>% 
  mutate(
    "cor_test" = map2(ASVID, env_var, \(i, j) {
      abundance <- asv_subset_matrix[rownames(asv_subset_matrix) == i, ]
      env_value <- env_matrix[, colnames(env_matrix) == j]
      cor.test(abundance, env_value)
    }),
    "rho" = map_dbl(cor_test, "estimate"),
    "p" = map_dbl(cor_test, "p.value")
  )

Code breakdown

Specifically, we will look at the map2() and map_dbl() calls:

Code Description

map2(ASVID, env_var, \(x, y) {<expression>}) Here, map2() takes in two vectors, namely ASVID and env_var and iterates through every element in each, in order of position. The last argument is the function that will actually process elements of the vectors. map2() will only take functions that takes 2 arguments as inputs (i.e., i and j). In here, I've used an anonymous function as we do not need to keep this in our environment.

map_dbl(cor_test, "estimate") After performing the analysis, we need to extract the relevant information. Here, map_dbl() is a strict and specific version of map() where it will only accept and output data with the mode double, meaning numerics with decimals or floating points. To indicate what we wanted to extract, we have given it a string that indicates the name of the vector we wish to take. You can also provide a function if you'd like, as long as the output of the function is strictly a double.

Rules to iterate by¶

As shown, for loops and map() are incredibly powerful tools. However, there are some best practices that should be adhered to in most cases to ensure code runs and exits safely. The sections below list some of them and scenarios where exceptions may arise.

Iterate indices¶

Iterating over indices of a vector, instead of directly iterating on the vector, allows for flexibility and predictable outputs. Vector indices are always sequential, thus you can be confident of which element is being processed, especially if it involves multiple inputs. The sequential nature of indices also mean that outputs are always in the order of the inputs, allowing you to safely join/bind them to existing data used as inputs. In addition, if you have data that contain extra information stored as attributes (discussed in next episode), you can safely re-assign them to the output for downstream use.

Thy can grow vectors (with constraints)¶

As mentioned above, it is good practice to pre-assign vectors to store outputs of iterations (if desired). This ensures that R pre-allocates some amount of memory for the outputs. Whether or not it is a sufficient pre-allocation depends on the data structure of the outputs (see below). Even if the output vector is not strictly known (perhaps there are if-else statements that control which iteration is stored), an educated guess (with extra length for margin of error) is still better than nothing. Remember, NA can be filtered out later. Furthermore, a benefit of knowing (roughly) how many iterations required means that the index of the output can be used as the iterator.

That being said, there may be situations where we truly do not know an appropriate length of an output vector. If that is the case, we should carefully construct loops with judicious use control flow elements like if-else, while, next, and break. Two reasonable scenarios where an empty vector can be grown (safely) is when you have control flow elements that (1) overwrite existing values (e.g., a counter or queue of sorts) or (2) control the size of the output vector. Growing vectors necessitates approaches that are position-agnostic. If you knew certain indices will produce certain outcomes, then you should've pre-assigned a vector of known length.

If you think you will be processing a large number of iterations, need to store the results, and the output might exceed the size of available memory, consider coercing those into "flat" text formats (e.g., tab/comma separated values, strings, etc.) and writing it out to disk with a function that can open and append existing files (e.g., readr::write_csv(x, append = TRUE)). If you are going to do this, be mindful of the available disk space.

Prefer vectorised functions¶

Functions that operate on the entirety of the vector input are called vectorised functions. These functions are optimised to run on entire data objects without explicitly iterating through the input element-by-element. This means that the processing is fast and efficient with minimal overheads. For example:

code

num_1 <- c(2, 4, 6, 8, 10)
num_2 <- c(1, 3, 5, 7, 9)

# Do this:
sums_a <- num_1 + num_2

# Avoid this:
sums_b <- numeric(5)
for (i in 1:5) {
  sums_b[i] <- sum(num_1[i], num_2[i])
}

# The results are the same
all(sums_a == sums_b)

Iterations in under the hood

Technically, most vectorised functions are iterations. However, these are implemented in C. For many functions that come with base R, this C code is called immediately. This means the iterations are happening at the level closest to actual processing speeds. Explicit iterations like for loops in R require additional translation into machine code which is an overhead. While map() does implement iterations in C, there are some processes that happen at the R level before it reaches C-level code.

Less is more¶

While iterations are helpful when you have a lot of data to go through, it can be computationally expensive. Try to reduce the work done within an iteration and use vectorised functions to achieve part of the task. Consider also if only part of data requires an iteration for analysis (i.e., pre-filtering the input data). If there are many steps that must be achieved using iteration, try to break those steps up into smaller iterations (i.e., modular code). When using iterations inside a function, it is even more important to exit iterations as quickly as possible so as to not stall if things go wrong (recall "failing fast").

"Less is more" also applies to the results of iterations. If each iteration computes and produces voluminous intermediate outcomes (e.g., permutation, bootstrapping, imputation, etc.), consider if all those values are required for downstream analyses within the same session. It is understood that these intermediate values can be valuable for troubleshooting and understanding the underlying assumptions of the analysis. In that case, consider redirecting them into files.

If you must iterate through a large amount of data, also consider using combining control flow elements like if-else with loop specific statements like break or next to make sure you are computing on elements and storing outputs that meet certain conditions.

Capture errors, return results¶

There are situations where errors persist despite our best efforts to guard against them. Defensive programming applies to scripting as well! It can be a expensive to run large numbers of iterations only for it to be stopped and error out without saving other valid outputs. In that case, R has a helpful function: try(). We know our calc_alpha_diversity() will error out on non-numeric data, let's use that to demonstrate how try() can help save computation from other valid inputs:

code

map(asv, \(x) {
  try({
    calc_alpha_diversity(x, index = "shannon")
  })
})

Structural and analytical considerations¶

How data is structured is often the most important consideration when deciding between using for loops and map(). If data and subsequent analyses are easily amenable to combinatorial methods like combn() or expand.grid(), we can make life easier by using the map() method. On the other hand, it is difficult (sometimes impossible) to coerce some data structures into "flat" tables without extensive modification (and code). For example, if the data is stored in a highly nested, complex, and uneven manner (e.g., XML, JSON, graph/tree structures), a for loop may be more appropriate, maintainable (AKA easy to debug), readable, and verbose. Some analyses are not amenable to "flattening", such as those that require updating objects in the global environment or is (conditionally) dependent on results from previous iterations. If your analyses necessitates non-trivial code for intensive computation (e.g., generates dependent intermediates, has complex control flow), a for loop is certainly more readable than an abstraction using map(). Remember, computation time is cheap compared to human time. Choose the best construct that will get the job done in reasonable time within hardware constraints.

Iterations¶

Say again?¶

for loops¶

Handling outputs¶

The map() family¶

Pairwise correlation analysis¶

Table-to-table using nested for loops¶

Cartesian product using expand.grid() and map2()¶

Rules to iterate by¶

Iterate indices¶

Thy can grow vectors (with constraints)¶

Prefer vectorised functions¶

Less is more¶

Capture errors, return results¶

Structural and analytical considerations¶

`for` loops¶

The `map()` family¶

Table-to-table using nested `for` loops¶

Cartesian product using `expand.grid()` and `map2()`¶