# Iterations¶

Learning objectives

- Construct
`for`

loops - Apply functions iteratively using
`map_*()`

- Make "good enough" design choices when iterating

## Say again?¶

Here, we leave behind copy-pasting and really start to accelerate workflows. To iterate means to do something repetitively on *different* objects. All programming languages have at least one implementation of iterations (arguably the main reason to do anything on a computer). In R, this is done in one of two ways:

`for`

loop

This is a ubiquitous method across many programming languages with slight syntax differences between languages.

**Applying functions using map()**

"Applies" a function along a vector/list with more concise code. In other words, the function supplied is recycled for every element in the vector/list. The iteration is happening in the background.

This episode will cover how to use both methods via a pairwise correlation analysis. The goal of the analysis is to determine what environmental factors might be correlated with sequence variant, and consequently, taxa, distribution. Finally, we will discuss some design choices with regards to iterations based on data properties.

`for`

loops¶

A `for`

loop is the most common way to iterate through objects. All `for`

loops look like the following:

Here is what each part of the pseudocode above means:

Code | Description |
---|---|

`for` |
Initialises the loop. |

`<iterator>` |
A variable that represents each element of the object being iterated on |

`<vector>` |
An object to iterate |

`<expressions>` |
Lines of code that does something per iteration |

Here is a simple working example:

The above example iterated through numbers `1:10`

, where each number was represented by `i`

(i.e., `for (i in 1:10)`

), then multiplied each number by `2`

, then printed the result on the console (i.e., `{print(i * 2)}`

).

The syntax of a `for`

loop is remarkably flexible and can be used in a variety of scenarios. Powerful as it is, there are a few best practices that we need to be mindful of. We will get to them in our practical exercise section below.

### Handling outputs¶

Consider the following loop:

We constructed a loop to multiply each number by 2, but we only have 1 number as a result. This is because we assigned `Ni`

with a new value with each iteration. To overcome this, we need to prepare an output vector that can store the result of each iteration.

Thy shall not grow vectors

It is possible to create an empty vector and then growing it by appending values from a `for`

loop.

However, this kind of construct is not recommended for ** most** use cases. If an iteration generates large number of elements, there might not be sufficient RAM to hold it and the entire R session will stall and terminate.

`while`

loops

As seen above, `for`

loops require a defined range for iteration. There are scenarios where we want an iteration to run for as long as required until some condition is reached. In this case, we can use the `while`

loop.

As long as the expression returns a `TRUE`

, the statement will continue to run. This is common in numerical optimisation where procedures are required for model fitting.

The following code block illustrates how `while`

loops are used in an iterative process to obtain small number solutions to the Collatz conjecture.

code

In lines 19 to 23, we applied a `while`

loop to ensure that `collatz()`

continues to be applied until `x == 1`

, which marks the end of the loop.

## The `map()`

family¶

While `for`

loops are flexible and powerful, writing them can sometimes result in long expressions. This hampers others (including future self) from auditing/reading the code. To help alleviate some of this, we can use the `purrr::map()`

family of functions. These functions aim to make code more function-oriented and makes code more concise. The table below shows some commonly used ones from both families:

Function | Description |
---|---|

`map()` |
Iterate over a vector. |

`map2()` |
Iterate over 2 vectors. |

`pmap()` |
Iterate over multiple vectors. |

The base forms above always return a list regardless of coercion within the "mapped" function. However, each function also have modifiers that can output different modes of data:

Modifier | Return values |
---|---|

`_int()` |
Integer (numeric whole numbers) |

`_dbl()` |
Double (numeric with decimals) |

`_chr()` |
Character |

`_lgl()` |
Boolean (`TRUE` or `FALSE` ) |

These functions are designed to produce predictable outputs (or, as written in the help pages, "die trying"). They all have the following syntax:

```
# One vector
map(vector, <function,character,integer>)
# Two vectors
map2(vector_1, vector_2, <function,character,integer>)
# List of multiple vectors
pmap(list(vector_1, vector_2, vector_3, ...), <function,character,integer>)
```

If a character or integer is supplied in place of a function, it will extract the component by name or position, respectively.

We'll make some examples to get a grasp of what `map()`

does:

code

We wrote a simple expression that concatenates elements of vectors in list `b`

and prints them out as one string per list element. Here's another example that uses a variation of `map_*()`

to calculate the coefficient of variation among biological replicates:

code

Notice that it is a numeric vector instead of a list as we used the `_dbl`

modifier. If we tried to use a different modifier that does not match the output mode, it will error out and will not produce an output.

Try different modifiers and see what are the outcomes. Include coercion inside the function and see if you can produce different results.

Sidestepping for side effects

The above `map_*()`

family of functions all produce an output per element. However, if your primary interest is to generate *side effects* (i.e., plots, console outputs via `cat()`

, saving files, etc.), it is better to use `purrr`

's family of functions: `walk()`

. Like `map_*()`

, it comes in variations for two (`walk2()`

) or more (`pwalk()`

) vectors. Compare the following outputs:

Let's compare and contrast use cases of `for`

and `map()`

using the exercise below.

## Pairwise correlation analysis¶

Community data is inherently multivariate. These are counts of organisms across different habitats/environments. A useful piece of information is how some organisms are distributed depending on the characteristics of their habitat, e.g., nutrient supply, sediment permeability, salinity, elevation, etc. Here, correlation analysis can help us understand if some of these characteristics underlie an organism's distribution. At a higher level, large correlation analyses can help identify some conditions that are more suited to groups of organisms (correlations can be a precondition of cluster analysis). From there, other analyses or more robust sampling designs can be used to test the validity of our observations.

Before delving into the analyses, it is important to filter out data that may not be important. This means removing ASV that are only sparsely present. For this exercise, we will retain ASVs that are present in more than half of all samples. We will also convert the data frames of `asv`

and `env`

into matrices for ease of processing.

code

### Table-to-table using nested `for`

loops¶

In this method, we will:

- Create matrices for storing correlation coefficients and their P-values.
- Use nested
`for`

loops to iterate through the rows of`asv`

and columns of`env`

. - Calculate correlation coefficients and significance per iteration
- Assign the values to relevant matrices

**Pre-assign outputs**

code

**Run correlation analyses via nested loop**

code

**Code breakdown**

Code | Description |
---|---|

`for (i in 1:nrow(asv_subset_matrix)) {` |
Initiates the outer `for` loop to iterate over the rows of `asv_subset_matrix` |

`abundance <- as.numeric(asv_subset_matrix[i, ])` |
Coerces the i^{th} row of `asv_subset_matrix` into a numeric vector, then assigns it to an object called `abundance` . This object is temporary and is over-written with each iteration. |

`for (j in 1:ncol(env_matrix)) {` |
Initiates the inner loop to iterate over the columns of `env_matrix` |

`env_value <- as.numeric(env_matrix[, j])` |
Like for `abundance` , coerces the j^{th} column of `env_matrix` into a numeric vector, then assigns it to `env_value` , a temporary object. |

`test <- cor.test(abundance, env_value)` |
This is the workhorse function that calculates the Pearson's correlation coefficient and performs hypothesis testing (i.e. is the coefficient different from 0). |

`rho_matrix[i, j] <- test$estimate` `p_matrix[i, j] <- test$p.value` |
Assigns the correlation coefficient and P-value to it's relevant position in the results matrices `rho_matrix` and `p_matrix` . |

`if (i %% 50 == 0) cat(str_glue("{i} ASVs done"), "\n")` `if (i == nrow(asv_subset_matrix)) cat("Completed!\n")` |
Prints some informative outputs for us to track it's progress at intervals of every 50 ASVs (the outer loop) until it is complete. |

**Check output**

`cor.test()`

Find out what `cor.test()`

returns by reading the help page

### Cartesian product using `expand.grid()`

and `map2()`

¶

In the above method, we generated two matrices that stored different, but related, information. However, we could also create a single table that contained both coefficient and P-value. To achieve this, we will assign all outputs that come from `cor.test()`

into one table that consists of all possible combinations between `colnames(env_matrix)`

and `rownames(asv_subset_matrix)`

(i.e., the Cartesian product).

**Pre-assign output**

code

**Iterate through outputs**

For this, we use the `tidyverse`

dialect to perform the entirety of the analysis. We will use `map2()`

to iterate over both columns simultaneously to `mutate()`

columns that will store the results.

code

**Code breakdown**

Specifically, we will look at the `map2()`

and `map_dbl()`

calls:

Code | Description |
---|---|

`map2(ASVID, env_var, \(x, y) {<expression>})` |
Here, `map2()` takes in two vectors, namely `ASVID` and `env_var` and iterates through every element in each, in order of position. The last argument is the function that will actually process elements of the vectors. `map2()` will only take functions that takes 2 arguments as inputs (i.e., `i` and `j` ). In here, I've used an anonymous function as we do not need to keep this in our environment. |

`map_dbl(cor_test, "estimate")` |
After performing the analysis, we need to extract the relevant information. Here, `map_dbl()` is a strict and specific version of `map()` where it will only accept and output data with the mode `double` , meaning numerics with decimals or floating points. To indicate what we wanted to extract, we have given it a string that indicates the name of the vector we wish to take. You can also provide a function if you'd like, as long as the output of the function is strictly a `double` . |

## Rules to iterate by¶

As shown, `for`

loops and `map()`

are incredibly powerful tools. However, there are some best practices that should be adhered to in most cases to ensure code runs and exits safely. The sections below list some of them and scenarios where exceptions may arise.

### Iterate indices¶

Iterating over indices of a vector, instead of directly iterating on the vector, allows for flexibility and predictable outputs. Vector indices are always sequential, thus you can be confident of which element is being processed, especially if it involves multiple inputs. The sequential nature of indices also mean that outputs are always in the order of the inputs, allowing you to safely join/bind them to existing data used as inputs. In addition, if you have data that contain extra information stored as attributes (discussed in next episode), you can safely re-assign them to the output for downstream use.

### Thy can grow vectors (with constraints)¶

As mentioned above, it is good practice to pre-assign vectors to store outputs of iterations (if desired). This ensures that R pre-allocates some amount of memory for the outputs. Whether or not it is a sufficient pre-allocation depends on the data structure of the outputs (see below). Even if the output vector is not strictly known (perhaps there are `if-else`

statements that control which iteration is stored), an educated guess (with extra length for margin of error) is still better than nothing. Remember, `NA`

can be filtered out later. Furthermore, a benefit of knowing (roughly) how many iterations required means that the index of the output can be used as the iterator.

That being said, there may be situations where we truly do not know an appropriate length of an output vector. If that is the case, we should carefully construct loops with judicious use control flow elements like `if-else`

, `while`

, `next`

, and `break`

. Two reasonable scenarios where an empty vector can be grown (safely) is when you have control flow elements that (1) overwrite existing values (e.g., a counter or queue of sorts) or (2) control the size of the output vector. Growing vectors necessitates approaches that are position-agnostic. *If you knew certain indices will produce certain outcomes, then you should've pre-assigned a vector of known length.*

If you think you will be processing a large number of iterations, need to store the results, and the output might exceed the size of available memory, consider coercing those into "flat" text formats (e.g., tab/comma separated values, strings, etc.) and writing it out to disk with a function that can open and append existing files (e.g., `readr::write_csv(x, append = TRUE)`

). If you are going to do this, be mindful of the available disk space.

### Prefer vectorised functions¶

Functions that operate on the entirety of the vector input are called vectorised functions. These functions are optimised to run on entire data objects without explicitly iterating through the input element-by-element. This means that the processing is fast and efficient with minimal overheads. For example:

code

Iterations in under the hood

Technically, most vectorised functions are iterations. However, these are implemented in C. For many functions that come with base R, this C code is called immediately. This means the iterations are happening at the level closest to actual processing speeds. *Explicit* iterations like `for`

loops in R require additional translation into machine code which is an overhead. While `map()`

does implement iterations in C, there are some processes that happen at the R level before it reaches C-level code.

### Less is more¶

While iterations are helpful when you have a lot of data to go through, it can be computationally expensive. Try to reduce the work done within an iteration and use vectorised functions to achieve part of the task. Consider also if only part of data requires an iteration for analysis (i.e., pre-filtering the input data). If there are many steps that must be achieved using iteration, try to break those steps up into smaller iterations (i.e., modular code). When using iterations inside a function, it is even more important to exit iterations as quickly as possible so as to not stall if things go wrong (recall "failing fast").

"Less is more" also applies to the results of iterations. If each iteration computes and produces voluminous intermediate outcomes (e.g., permutation, bootstrapping, imputation, etc.), consider if all those values are required for downstream analyses within the same session. It is understood that these intermediate values can be valuable for troubleshooting and understanding the underlying assumptions of the analysis. In that case, consider redirecting them into files.

If you must iterate through a large amount of data, also consider using combining control flow elements like `if-else`

with loop specific statements like `break`

or `next`

to make sure you are computing on elements and storing outputs that meet certain conditions.

### Capture errors, return results¶

There are situations where errors persist despite our best efforts to guard against them. Defensive programming applies to scripting as well! It can be a expensive to run large numbers of iterations only for it to be stopped and error out without saving other valid outputs. In that case, R has a helpful function: `try()`

. We know our `calc_alpha_diversity()`

will error out on non-numeric data, let's use that to demonstrate how `try()`

can help save computation from other valid inputs:

### Structural and analytical considerations¶

How data is structured is often the most important consideration when deciding between using `for`

loops and `map()`

. If data and subsequent analyses are easily amenable to combinatorial methods like `combn()`

or `expand.grid()`

, we can make life easier by using the `map()`

method. On the other hand, it is difficult (sometimes impossible) to coerce some data structures into "flat" tables without extensive modification (and code). For example, if the data is stored in a highly nested, complex, and uneven manner (e.g., XML, JSON, graph/tree structures), a `for`

loop may be more appropriate, maintainable (AKA easy to debug), readable, and verbose. Some analyses are not amenable to "flattening", such as those that require updating objects in the global environment or is (conditionally) dependent on results from previous iterations. If your analyses necessitates non-trivial code for intensive computation (e.g., generates dependent intermediates, has complex control flow), a `for`

loop is certainly more readable than an abstraction using `map()`

. Remember, computation time is cheap compared to human time. Choose the best construct that will get the job done in reasonable time within hardware constraints.