String manipulation¶

Learning objectives

Construct regular expressions
Find and count patterns in strings
Replace and subset (sub-)strings
Concatenate strings

Can't I change it in Excel?¶

Sure, go ahead for small datasets and ad hoc analyses. However, can we be confident of being consistent for large datasets? Did we miss a few cells? Did we introduce extra punctuation? Can Excel handle the amount of data? A good understanding of string manipulation can be useful for data cleaning (as mentioned before) and come in handy for many situations. For example:

In this lesson, we will learn how to use functions from the stringr package to manipulate and extract strings from taxonomic lineages in the tax data frame.

Biological strings

At the fundamental level, all biological sequences (i.e., DNA, RNA, and amino acid) are strings. While the functions covered here can handle sequences, more is needed to extract biological relevance from them. When we think about biological sequences such as nucleic or amino acid sequences, we are interested in the relationships within and between strings (e.g., translations, alignments, k-mers, secondary structures, etc.). R has some packages that can handle such operations (e.g., seqinr, Biostrings, DADA2; see other packages in the Bioconductor repository). For those analyses, other programming languages offer better support and computational efficiencies.

Regular expressions¶

A common thread (pun intended) in all string-based operations is regular expression (AKA regex or regexp). You probably have some experience with it if you have used sed or awk in bash. Formally, regular expressions are characters that represent finite sets of characters and the operations that can be performed on them. It is helpful to think of them as patterns.

Consider this example:

[0-9]+.*_(log|chk)\\.txt

The pattern is read from left to right. Let's break it down:

[0-9] is a set of numbers from 0 to 9. The square brackets ([]) are meta-characters that allow matches to ranges (as in this example) or individual characters.
+ indicates that the pattern preceding it should occur \(\ge\) 1 time(s)
.* means to match anything . \(\ge\) 0 times *
_ is just an underscore
(log|chk) means to match either log or chk as a word. The | functions as an OR operator and the round brackets () indicate that the patterns inside them must be interpreted as a word.
\\. matches a literal dot . where the double backslashes \\ mean to "escape" the pattern following it

Based on the pattern/expression, we can safely assume that it should match a text file with a date or process number as a prefix, followed by some description and the output type. For example the expression would find 12345_some_log.txt but would not find file.txt because it does not start with any digits.

In most cases, learning to construct regular expressions is based on trial and error and many Google searches. To soften the learning curve, the stringr team compiled a helpful cheat sheet we can reference.

Wrangling taxonomy¶

Obtaining microbial taxonomy from DNA sequences¶

A major aim of microbial ecology is the identification of populations across an environment. We do that by sequencing the amplicon of the 16S small subunit ribosomal RNA gene, the standard taxonomic marker. Then, sequences are clustered based on sequence similarity (to reduce redundancy and improve computational efficiency) and then assigned a taxonomic lineage using a classifier that compares our sequence data with those in a reference database (popular options are SILVA and Greengenes 2). Depending on how similar and well-represented the sampled sequences are to those in the database, our sequences will be assigned names and ranks ranging from domain to species.

Inspecting taxonomy¶

Let us begin by inspecting what our taxonomy looks like.

code

head(tax$Taxon)
tail(tax$Taxon)

What's in a taxonomy

A semicolon ; separates the ranks
Ranks are given a single-letter prefix followed by __
Ranks are unevenly assigned. Some are identified down to species level, while only phylum is known in others.

Detecting and extracting patterns¶

Some initial questions when inspecting the above taxonomy are:

How well characterised are our sequences?
Did we manage to retrieve biologically important taxa?

We can answer those questions using pattern detection.

1. How well characterised are our sequences?

Let's apply a heuristic and answer a simpler question: How many sequences were classified at each taxonomic rank (species, genus, family, order, class, phylum)? If there are large numbers of sequences that were only identified at higher taxonomic ranks, the system we are studying may harbour lots of novel microbial populations.

code

str_detect(tax$Taxon, "s__") %>%
  sum()

In the code above, we used the function str_detect() to find the species prefix s__ in the Taxon column. The output of str_detect() is a logical/boolean vector. Thus, we use sum() to count the number of TRUE statements.

stringr syntax

Most functions in the stringr package accept arguments in this order:

str_<name>(<vector>, <pattern>, ...)

Question

What is the proportions of ASVs that have been assigned a lineage with rank of genus and phylum?

Solution

sum(str_detect(tax$Taxon, "g__")) / nrow(tax)
sum(str_detect(tax$Taxon, "p__")) / nrow(tax)

It looks like our sequences are well characterised from the rank of genus and up.

2. Did we manage to retrieve biologically important taxa?

An ecosystem service that estuaries provide is nitrogen removal (via denitrification). These are usually performed by prokaryotes spanning the Bacterial and Archaeal domains. Their metabolic activity ensures that excess nitrogen is removed in gaseous form and thus prevents eutrophication. The starting substrate for denitrification is nitrate. Thus, reduced nitrogen must first be oxidised via nitrification. Two communities are involved in the conversion from reduced to oxidised nitrogen:

Ammonia oxidisers (usually has the prefix "Nitroso" in their taxonomy)
Nitrite oxidisers (usually has the prefix "Nitro" in their taxonomy)

Lets find out if we managed to sample any of them.

We will first need to subset the vector to retain those that have "Nitro" in their names. We will do this using str_subset().

code

nitro <- str_subset(tax$Taxon, "__Nitro")
length(nitro)

We will also take a finer look at their lineages so we can get a better idea of which community they belong to.

code

str_replace(
  nitro,
  "d__([^;]+);.*(Nitro[a-z]+).*",
  "\\1, \\2"
) %>% 
    unique()

The code above is quite complicated. Let's break it down.

The function str_replace() is a flexible function that helps us extract and replace substrings depending on how the regex was constructed.
d__([^;]+); looks for the sub-string d__ followed by anything that is not a semicolon [^;]+ more than once. The regex [^<some_pattern>] means to match anything that is NOT <some_pattern>. The round brackets () "captures" or "saves" the matches within it for replacement. This is followed by a semicolon (our rank separator) which is not captured but is present in the vector.
.*(Nitro[a-z]+).* As we do not know at which rank the first instance of "Nitro" will appear, the regex .* will match anything . more than 0 times *. At the first "Nitro" it encounters, we will also look for any subsequent letters in small case ranging from 'a' to 'z' as represented by Nitro[a-z]+. Anything after that can be matched but is not captured.
The last argument in the function specifies how the replacement string should look like. \\1, \\2 replaces the output with the two patterns we captured separated by a comma and a space. Patterns are captured sequentially and must be referenced in the order which they appear in the original string. Therefore, if we wanted the "Nitro" part to be in front, we would reverse the order to \\2, \\1.

In case of failure...

If str_replace() cannot find matches for the given pattern, it will return the original string. This is a safety mechanism. We can choose to filter it out later if necessary.

Run the following and see for yourself:

code

str_replace(nitro, ".*p__([^;]+).*g__([^;]+).*", "\\1, \\2") %>% 
  unique()

Other useful functions

The functions str_detect() and str_replace() were the focus of this lesson for their flexible application and ease of visualising how regex and pattern capture works. Over the years, I have also found the functions below to be useful

Concatenation: str_c(), paste()

Concatenates any number of string vectors per element (via the sep = argument) and/or across elements (via the collapse = argument)

code

str_c(fruit[1:5], words[1:5], sep = ", ")
str_c(fruit[1:5], collapse = "||")

Interpolation: str_glue()

Evaluates expressions within {}, and then interpolate and concatenate them as strings. Very useful for programmatic use.

code

prop_species <- sum(str_detect(tax$Taxon, "s__")) / nrow(tax)

print(
  str_glue("QIIME2 classified {prop_species} of ASVs down to species level.")
)

Separation by delimiter: str_split()

Splits a string based on a provided delimiter and returns a character vector. If a character vector of length > 1 is provided as input, it will return a list of character vectors with each list element split based on the delimiter.

code

str_split(nitro, pattern = "; ")

Whitespace trimming: str_trim()

Removes whitespace at each end of the string. Very useful during data cleaning to make sure there are no trailing whitespaces that prevents downstream analyses.

code

str_trim("   this has blanks   ")

I highly recommend playing with the functions above to get a feel for how they work. stringr has some built-in character vectors that you can use on-the-fly as test cases: fruit, words, and sentences.