Functions and conditional statements¶
Learning objectives
- Identify parts of a function
- Write custom functions
- Use conditional statements to control outputs
- Refine functions to improve efficiency
- Access function source code from other packages
Anatomy of functions¶
It's not a secret that for as long you've used R, you've been using functions. Here, we're going to make them! Functions are, like all things in R, a type of object (implications discussed in a later episode). They are the "verbs" of the R programming language where, instead of storing data, functions store code that defines how it behaves and operates on other objects.
In an R script, functions are created and assigned using the following syntax:
Lets break down the anatomy of the pseudocode:
f
is the assigned name of the function.function()
initialises a function call, telling R that- We are writing a function
- Anything inside round brackets are arguments for the function
- Anything after the closing round bracket and on the same line is the code that defines the function
{ <do_something> }
is the expression that defines how the function works
Let's write a function to give some context as to what each part does.
The above is an equation for a function called \(f\) that takes a value \(x\) as input and outputs a value based on \(x\). We can translate that into R code:
Here, we've "wrapped" an arithmetic operation in a function.
Alternative syntax
We can also use a shorthand for the function()
call by replacing function
with a \
:
This syntax is useful for on-the-fly functions that can be applied using map()
functions.
Notice also that the function expression is on the same line as the function call. This syntax is acceptable for short expressions but not encouraged for longer ones.
Infix functions
Most objects we think of as functions are prefix functions, where arguments are placed after the function call. However, there are also infix functions, where arguments are placed on either side of the function call. For example:
Both <-
and +
are infix functions for assignment and addition, respectively. All basic arithmetic (e.g., +
, -
, *
, and /
), assignment (including =
), logical (e.g., |
, &
, &&
, etc.), and relational (e.g., >
, !=
, ==
, etc.) operators are primitive (i.e., foundational part of the language) infix functions.
The pipe operator %>%
introduced by the magrittr
package is a user-generated infix function. As this is a custom infix function, the %
that surrounds the operator is necessary. It does not have to be a symbol or operator and can be represented by anything really, (e.g., %in%
) even emojis!
We can define our own infix function:
Control flow 1: Conditional execution¶
We will take a little detour from functions to learn about control flow. These are constructs that code for decision making (this episode) and repetition (next episode). Decision making/conditional execution helps to direct the "flow" of operations using Boolean logic (TRUE
/FALSE
). In R, conditional execution is coded using if-else
statments:
The pseudocode above means if
the <predicate>
is (or returns a) TRUE
, execute <expression_1>
, otherwise execute <expression_2>
.
For example:
Predicate and predicate functions
A predicate is a statement or function that evaluation that returns one TRUE
or FALSE
value. For example, this is a predicate that evaluates if x
is larger than 5:
If we set x <- 10
, the code will return a TRUE
.
R has many built-in predicate functions (i.e., functions that evaluate a statement or object and returns a TRUE
/FALSE
). They are often prefixed using is.<something>()
. For example:
code
Notice that regardless of the length of the vector, the predicate functions return only one Boolean variable. This is important for conditional statements as it can only evaluate predicates and not a vector of Boolean variables.
The else
statement is optional (and often unnecessary). The expression will not be evaluated if the predicate returns FALSE
.
code
For complex predicate evaluations (e.g., intermediate thresholds) or evaluations that lead to different paths (e.g., hierarchical decision making), we can chain and/or nest multiple statments together:
if (<predicate_1>) {
<expression_1>
} else if (<predicate_2>) { # Chained statements: an if-else after an if-else
if (<predicate_2a>) { # Nested statements: an if-else inside and if-else
<expression_2a>
} else {
<expression_2b>
}
} else {
<expression_3>
}
Vectorised conditional execution¶
The above conditional statements only operate when there is only one Boolean value. What if we wanted to generate different outcomes depending on the elements in a vector? In this case, we can use these two functions depending on how many outcomes we need:
If we needed one outcome each for TRUE
or FALSE
, we can use if_else()
, a vectorised if-else statement that evaluates a Boolean vector (or a condition that returns one) and produces outcomes depending on whether each element is a TRUE or FALSE. For example:
code
Base R equivalent: ifelse()
Base R also has a similar implementation: ifelse()
. However, dplyr::if_else()
strictly preserves the data type (e.g., the output will not be coerced to a character if the input is a factor) and can evaluate missing values; base::ifelse()
does not enforce data type preservation and returns missing values when it encounters them.
If we required different outcomes based on multiple, hierarchical conditions, we can use case_when()
, a general vectorised if-else analogous to chained if-else statements. For example:
code
Alpha diversity¶
Time to apply all of the above to write a function! For this section, we will be writing a function that calculates the \(\alpha\)-diversity (a measure local diversity) of a sample. The simplest measure is richness (often denoted \(q\) or \(S\)), which represents the number of species/taxonomic units. While richness is simple and intuitive to understand, it is heavily biased by rare members of the sample community. This is always the case with microbiome data. Other \(\alpha\)-diversity indices were constructed to account for the uneven distribution of taxonomic units by incorporating relative abundance information. Popular options are:
Shannon's index, \(H\)
Where \(p_{i}\) is the proportion (or relative frequency) of the \(i^{th}\) organism relative to the sample total.
Simpson's diversity, \(D\)
Where the \(\lambda\) is Simpson's concentration index, defined as
For the mathematically inclined, you might observe that these indices:
- Gives more weight to abundant taxa (\(D\) moreso than \(H\))
- Reach their maxima when all taxa are exactly evenly distributed in the sample
- Reach their minima when one taxa that dominate
- Are not on the same scale (\(H\) is on a log scale, \(q\) and \(D\) are on the "number of taxonomic units" scale)
For now, it is enough to know that they are mathematical formulas that we are going to translate into code.
Starting small: richness¶
We will start by writing a function to calculate richness (the number of taxonomic units in the sample). Applied in R, this means the number of non-zero elements in a numeric vector.
code
Adding arguments: Shannon's and Simpsons's diversity¶
Next, lets add the other indices to be part of the function. We will use conditional execution to control output depending on the index is requested.
code
Here, we're calculating all possible outcomes, then returning values based on the index requested.
return()
The return()
function returns/outputs the value specified in it and stops evaluation of any code below it. In calc_alpha_diversity()
, if the index requested is richness, and subsequent code to calculate the other indices are not evaluated. In the case where a function code has no return()
, the function will output the last evaluation.
The function we wrote works for \(q\) and \(D\), but not for \(H\). Lets revise that.
Troubleshooting and refining code ¶
Lets troubleshoot our function first. Why did the code not work for Shannon's index?
A little math
log(0)
is undefined. However, in R, it is given the value -Inf
for bases > 0 and Inf
for bases \< 0. If any p
is 0, the operation 0 * log(0)
will return NaN
(Not a Number). Therefore, the summation of that will return NA
.
We will also consider other areas for refinement:
Relative frequency p
is not required for richness calculations and is shared for calculations of Shannon's and Simpson's indices.
Assigning of \(H\) and \(D\) helps us, the human, to differentiate the indices in mathematical notation. However, R does not care which name is used. The decision to return either index is already managed by conditional statements.
Chained statements are not necessary here. Conditional execution (and return()
) should occur earlier to prevent unnecessary code evaluation.
Some people may have a reason to use another base for the log()
function used in calculating Shannon's index. While the natural logarithm (log in base \(e\)) is often the default choice, there should be an option to change that if desired. To set a default value in a custom function, add the desired value to the argument within the function call (see below).
With that, lets revise the function:
code
It works! Note that without a return, the function simply outputs the last evaluted code (in this case, this is the value H
).
Defending our function ¶
What would happen if we used our functions with non-ideal inputs? For this, we will take hints from the concepts in defensive programming. Defensive programming is more about design choices than cybersecurity, where we approach designing code that:
- Has predictable outputs, regardless of inputs
- "Fails fast", thus saving time and computational resources
- Reads well, so others can audit the code and understand what each line is doing
Can you identify some scenarios where our function might not do so well?
Pitfalls and food for thought
What does NA
mean?
Lets consider how these tables were constructed. They were originally sequences that were processed by a software and then reads were counted for each taxonomic unit. If a taxonomic unit was not found in a particular sample read file, the software would have returned a 0. Therefore, NA
could not have arisen from processing this particular batch.
However, we can imagine a scenario where NA
could be introduced from a joining step during data cleaning or pre-processing. This can happen when processing samples by batches.
Critically, regardless of the reasons for missingness, is a missing count functionally equivalent to 0?
What is the effect of NA
?
We used sum()
quite often in the code. Furthermore, we even added na.rm = TRUE
in the expression used to calculate Shannon's index to account for the NA
introduced when evaluating \(0 \log 0\). Would we handle NA
in all parts where sum()
is used in this way?
Do negative numbers make sense?
Remember that this is a table of counts. What do negative numbers mean in that context? Does it make sense?
How does the function handle negative numbers?
\(q\) would ignore negative numbers. If the negative counts were an oversight or accidental, it would result in an undercount of richness.
\(H\) would not produce a numeric output (the log of a negative number is undefined).
\(D\) would be an underestimate (the sum would be an undercount of the whole).
Does non-numeric data make sense for this function?
These indices are meant for count data (i.e., numeric data). Suppose that human error led to the mis-typing of inputs, would we want to coerce the data into numeric type using as.numeric()
? Are we guessing the intentions and errors of the user? Coercing characters that are not numbers will cause as.numeric()
to return NA
.
Is it fit for purpose?
We established that the function is meant to provide information only on numeric data. Albeit being a small function that runs quickly, we can imagine a scenario where we might either (1) write a function that has many complex steps or (2) run a substantially larger data set through the diversity function. In either scenario, would we want to spend time and computational resources on non-target inputs?
Let's test our function through these non-ideal scenarios. Create vectors with non-ideal properties and try them on the current version of the function.
code
Random replacements
The code above uses two ways to generate replacements at random. In line 3, a random sampler sample.int()
was used to indicate 5 random indices to which we wanted to inject NA
. In lines 6-9, a conditional statement was constructed to replace positive, non-zero values, picked based on a numeric (double) vector picked from a uniform distribution, with negative equivalents.
With those pitfalls in mind, we will need to make some design decisions and implement them in code. Here are my choices:
Functionally, they are the same as not observing a taxa in a sample (equivalent to 0). Warn the user when this happens but continue with evaluating input.
These data types are not intended for use with diversity indices here. Lets not spend computing time and resources on them. To achieve this, we will "error out" of the function where if negative numbers and non-numeric data are encountered, the function will stop further computation and produce an error message.
Feel free to implement your own design choices by modifying the code below!
code
Try the new function on the test vectors:
code
Improve function design¶
Read other functions¶
One of the best ways to improve how we write functions is to read other people's functions! Not only do we get to see how others design their functions, we can also learn new and useful functions along the way!
We can access the source code of a function using getAnywhere()
:
Some functions may not have directly accessible source code. For example:
Notice that the body of the function consists only of UseMethod("anova")
. This is because anova()
can take many different kinds of inputs and run different code depending on the input (i.e., it has different methods()
for the same function). We can see what those methods are and inspect the source code for any of them:
code
In some cases, a function could have the same name but are implemented differently because they are part of a different package. For example:
To access the filter()
source code from the stats
package, we need pick one by its index:
return()
early¶
Use return()
early in the source code especially if other arguments will lead to more time consuming evaluations. This prevents wasteful use of time and resources to compute variables that will not even be used as outputs!
switch()
evaluations depending on argument input¶
swtich()
is a useful function that evaluates different expressions depending on the argument inputs. It can also be given a default output when no arguments match those specified which can be coded to output a warning or message.
code
Think of the users¶
Who is the target user for your functions?
If it's just you, do whatever you want! However, the fact that you are assigning code to functions implies that you probably want to reuse it for other things/projects or even share them with your colleagues/the world!
If you intend to share your functions with other people/computers/robots, it is important to think about dependencies. Other people will have different set-ups. Thus, it is advisable to write functions using other functions in base R. If you must use other packages, make sure to advise and install early in the script.
While we write functions to be as predictable, elegant, and concise as possible, there are situations where we want to add flexibility. Keep in mind that flexible functions often involve more code. To achieve this, we can write functions that call other custom functions. This is called modularisation. Modular code also helps with troubleshooting, readability, and benchmarking as pieces of it can be tested. This is especially important for production-level code.
What stage are you at?
Assuming you are writing functions for your analyses, the syntax of your functions are likely geared towards your specific needs at a particular given stage. At the exploratory analysis stage, functions are written with flexbility in mind, perhaps even a little messy or less efficient that it could be.
As you progress to production level code and/or need to use complex processes (permutation tests, sub-sampling, Bayesian methods, etc.), you're likely going to need to think about:
Code safety Predictable and non-redundant inputs and outpus, and preventing ad inifinitum runs
Efficiency Use benchmarked algorithms, reduce/remove side effects, parallel processing, reduce overheads
Troubleshooting Informative errors and warnings, modularised functions
Readability Commented code, verbose syntax, appropriate assignments (as opposed to long piped processes)
Break your own code!¶
The best way to learn from your own functions? Break them! How would it handle nonsensical inputs? What are its limits? How much data is too much data? Is it taking seconds, minutes, days to run? We can learn so much about how the function (and R in general) works when we stress test them.