Additional stringr practise and examples¶

Can't I change it in Excel? And other reasons to learn string manipulation.¶

You receive 10 files from someone. Five of the files all share the same naming convention (e.g., all column names are in full capitals). The other five files have a mix: some column names are lower case, some have the first letter capitalised, and some have a mix of full, first, and lower. You want all your files to have the same naming convention, so you know you'll want to change the column names for those five files. It's not many columns per file, and there are only five of them... Should you change those names in Excel?

Short answer: sure; long answer: it depends. If the data is small enough, Excel (or any spreadsheet software) is sufficient to fulfil your exploratory data analysis needs. However, for larger data sets, can you ensure that you have performed adequate, consistent, and robust data cleaning before analyses? Furthermore, if you are importing tabular data into R, you likely intend to perform statistical/numerical analyses and create a near-publication-level figure. Are your categorical variables consistent in their spelling? Did you check whether your European collaborator used a dot instead of a comma to delineate decimal points? Will you extract every statistic and string them together as your axis title for a 6-panel figure by hand? This is where understanding how to manipulate strings en masse is helpful.

String manipulation is a fundamental skill in every programming language, and every language has its perks and quirks in handling strings. The common thread (pun intended) that ties them together is regular expressions which form the basis for patterns and their related operations. In bash, you may have used them to find or substitute parts of a string using grep or sed/awk. Here, we will cover some regular expressions as part of the examples and illustrate how they are interpreted in R. However, the topic of regular expression itself is vast. Please refer to the second page of the stringr cheat sheet for comprehensive coverage of most of the regex in R.

Biological strings

At the fundamental level, all biological sequences are strings. They can be handled with the functions covered here, but more is needed to extract biological relevance from them. When we think about biological sequences such as nucleic or amino acid sequences, we are interested in the relationships within and between strings (translations, alignments, k-mers, secondary structures, etc.). R has some packages that can handle such operations (e.g., seqinr, Biostrings, DADA2, see other packages in the Bioconductor repository). However, if these analyses are your primary interest, I suggest you look into other programming languages due to the inefficiencies of R in this regard.

A brief detour¶

This lesson will briefly step away from exploratory data analyses and explore the realm of string-based methods in R via the stringr package. For this lesson, we will work with the following biological context:

A characteristic of estuarine sediment (where this data originated from) is its active nitrogen-cycling members of the prokaryotic community. For those participating in ammonia or nitrite oxidation, their lineages often have the prefix "Nitroso-" for putative ammonia oxidisers and "Nitro-" for putative nitrite oxidisers. However, specific lineages within these groups (assigned at varying taxonomic levels) bear the prefix but do not harbour the necessary genetic machinery for this function.

The above forms the boundary around the data which we will use for this lesson. Here, we will explore a group of functions at an abstract level. We aim to provide awareness to a set of tools to manipulate (i.e., search, subset, substitute) strings. Alongside that, we want you to be comfortable in constructing regular expressions. This can be daunting if you are new to them. However, we like you to keep in mind that it is both an engineering problem (maximising efficiency with pattern captures) and an art form (being creative and playful to adapt to situations). We stress that there are multiple ways (even within the same framework) to obtain similar results. Some functions are Swiss-army knives when paired with the right regex patterns.

Subset a character vector¶

Lets prepare a character vector for us to work on.

code

nitros <- str_subset(tax$Taxon, "Nitro")
head(nitros)

[1] "d__Bacteria; p__Nitrospirota; c__Nitrospiria; o__Nitrospirales; f__Nitrospiraceae; g__Nitrospira; s__uncultured_Cytophaga"            
[2] "d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Nitrosococcales; f__Nitrosococcaceae; g__SZB85; s__uncultured_bacterium"   
[3] "d__Archaea; p__Crenarchaeota; c__Nitrososphaeria; o__Nitrosopumilales; f__Nitrosopumilaceae; g__Candidatus_Nitrosopumilus"            
[4] "d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Nitrosococcales; f__Nitrosococcaceae; g__SZB85; s__uncultured_bacterium"   
[5] "d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Nitrosococcales; f__Nitrosococcaceae; g__SZB85; s__uncultured_bacterium"   
[6] "d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Nitrosococcales; f__Nitrosococcaceae; g__FS142-36B-02; s__uncultured_gamma"

This block of code returns a new character object nitros, which is made up of every element (line) where the pattern "Nitro" occurrs. In this case the pattern is a sub-string - any string that contains "Nitro" i.e., both "Nitro" and "Nitros" will be captured.

Equivalent code

Vector subset (tidyverse)Pattern matching (base R)

tax$Taxon[str_detect(tax$Taxon, "Nitro")]

grep(pattern = "Nitro", x = tax$Taxon, value = TRUE)

Anatomy of pattern-based string functions in stringr

All stringr functions start with the string to operate on as the first argument. For functions that work with regex, the second argument is always the pattern. For functions that perform replacements, the pattern used for replacement is the third argument. For functions that perform searches, the third argument is usually negate = <Boolean> which when TRUE does an inverse search of the pattern along the string.

str_*(<string>, <pattern>, <replacement>|<negate>)

Extract sub-string¶

What lineages do our ammonia and nitrite oxidisers come from? We can inspect their assignment at various taxonomic levels to find out.

code

# Extract domain-level assignments
str_extract(nitros, "d__[^;]+") %>%
  unique()

# Extract phylum-level assignments
str_replace(nitros, ".+p__([^;]+);.+", "Phylum: \\1") %>%
  unique()

[1] "d__Bacteria" "d__Archaea"

[1] "Phylum: Nitrospirota"   "Phylum: Proteobacteria" "Phylum: Crenarchaeota"  "Phylum: Nitrospinota"  
[5] "Phylum: Zixibacteria"

Here we used two different ways of getting sub-strings out of long strings.

str_extract() searches the nitros vector for cases of d__ followed by anything that is NOT a ";" ([^;]) more than once (+). Note that the taxonomic levels are delineated using a semicolon. When it finds a match to a semicolon, it will stop and return the resulting sub-string. The text below illustrates the pattern matching where + represents "still matching" and ! represents "not a match".
```
string:     d__Bacteria; p__Nitrospirota; c__Nitrospiria; o__Nitrospirales; f__Nitrospiraceae; g__Nitrospira; s__uncultured_Cytophaga
extract:    d__[^;]++++!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
```
str_replace looks at everything .+ before p__ then captures the pattern "anything except a semicolon more than once" [^;]+. This captured pattern ends with a semicolon and everything else in the string et cetera. We replace the pattern with Phylum: then the captured pattern represented in regex by \\1 (meaning captured pattern 1). The text below illustrates the matched string, where "everything else" is represented using *.
```
string:     d__Bacteria; p__Nitrospirota; c__Nitrospiria; o__Nitrospirales; f__Nitrospiraceae; g__Nitrospira; s__uncultured_Cytophaga
match:      *************p__[^;]++++++++;********************************************************************************************
capture:    !!!!!!!!!!!!!!!!Nitrospirota!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
```

Lets attempt to capture more than one pattern:

code

# Extract phylum and genus level assignments and sort unique outputs
str_replace(nitros, ".+p__([^;]+);.+g__([^;]+).*", "Phylum: \\1, Genus: \\2") %>% 
  unique() %>% 
  str_sort()

 [1] "d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Nitrococcales; f__Nitrococcaceae"    
 [2] "d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Nitrosococcales; f__Nitrosococcaceae"
 [3] "Phylum: Crenarchaeota, Genus: Candidatus_Nitrosopelagicus"                                      
 [4] "Phylum: Crenarchaeota, Genus: Candidatus_Nitrosopumilus"                                        
 [5] "Phylum: Crenarchaeota, Genus: Nitrosopumilaceae"                                                
 [6] "Phylum: Nitrospinota, Genus: P9X2b3D02"                                                         
 [7] "Phylum: Nitrospirota, Genus: 4-29-1"                                                            
 [8] "Phylum: Nitrospirota, Genus: Nitrospira"                                                        
 [9] "Phylum: Nitrospirota, Genus: uncultured"                                                        
[10] "Phylum: Proteobacteria, Genus: 3PJM14"                                                          
[11] "Phylum: Proteobacteria, Genus: AqS1"                                                            
[12] "Phylum: Proteobacteria, Genus: Cm1-21"                                                          
[13] "Phylum: Proteobacteria, Genus: FS142-36B-02"                                                    
[14] "Phylum: Proteobacteria, Genus: IS-44"                                                           
[15] "Phylum: Proteobacteria, Genus: MSB-1D1"                                                         
[16] "Phylum: Proteobacteria, Genus: Nitrosomonas"                                                    
[17] "Phylum: Proteobacteria, Genus: SZB85"                                                           
[18] "Phylum: Zixibacteria, Genus: Zixibacteria"

Let's break a few items down:

We are now capturing two patterns: \\1 represents the phylum name, and \\2 represents the genus name. The numbers represent the order of the captured pattern that occurs in the string.
Some of it did not work, resulting in the entire string being printed out (the default behaviour in many of the stringr functions, better safe than sorry!)

How would you fix the issue in point 2?

Prior to str_replace(), add a str_subset() and use the genus prefix g__ to retain all taxonomies with valid genera assignments.

code

str_subset(nitros, "g__") %>% 
  str_replace(".+p__([^;]+);.+g__([^;]+).*", "Phylum: \\1, Genus: \\2") %>% 
  unique() %>% 
  str_sort()

Notice the subtle change in the regex from .+ to .* at the end of the pattern? The combination .+ means "anything more than once", whereas .* means "anything more than zero times". Most of the time, either pattern will create the desired output. However, in this example, the result will look strange if + was used instead of *. Try it to see for yourself what the difference is.

Detecting and counting patterns¶

How many of these lineages are putative ammonia oxidisers (i.e., "Nitroso-")? Lets count them:

code

str_detect(nitros, "Nitroso") %>%
  sum()

[1] 75

If we only run the str_detect() part, it will return a Boolean logical vector which sum() uses to tally up the number of TRUE elements.

How would you count the number of lineages for putative nitrite, but not ammonia, oxidisers?

Remember that putative nitrite oxidisers carry the "Nitro-" prefix.

Solution

Given that the entirety of nitros contains sequence variants that have "Nitro" somewhere along it's assigned lineage, we need only count those that do not contain "Nitroso".

code

str_detect(nitros, "Nitroso", negate = TRUE) %>%
  sum()

Lets ask a different question:

How often does the prefix Nitroso- occur along the taxonomic lineage?

Here, we are interested in the substring frequency along each element across the vector. The difference between this and the previous question is subtle. The code below can help illuminate the distinction.

code

str_count(nitros, "Nitroso")

 [1] 0 2 4 2 2 2 0 2 4 2 2 0 2 4 4 0 2 4 0 0 2 0 0 4 0 0 2 0 0 0 2 4 2 1 4 2 4 1 4 4 1 4 2 0 5 2 0 0 2 1 2 0 0 0 0
[56] 2 0 2 0 2 2 2 2 4 4 4 0 4 0 4 2 4 2 4 0 4 2 0 4 0 1 4 4 4 2 0 4 4 2 2 0 4 2 1 2 0 2 2 2 2 0 2 2 0 2 4 44

To summarise:

str_detect() %>% sum() tallies the number of elements that contain a pattern. A Boolean vector is returned by str_detect(), and then TRUE statements are tallied by sum().
str_count() tallies the number of times a pattern is observed in an element. A numeric vector is returned.

Tap tap... Is this working?¶

How do you know if the search pattern worked? In other words, did the search pattern target the intended sub-strings? Conversely, did it also target sub-strings we did not intend to capture? When starting out in learning regular expressions, it is useful to visually inspect what the pattern supplied is matching to. Here, we will use str_view() to do "unit tests" on strings to check our patterns.

In the previous section, we extracted and printed phylum-genera pairs from the nitros vector. Notice that there are genera with odd code-like name (e.g., P9X2b3D02, Cm1-21)? How would we construct a regex that can be used to count the number of named genera we have in tax$Taxon? Before exploring the logic required to construct this regex, lets create a smaller vector of unique genus assignments. It is easier to build up a regex by working on a smaller problem.

code

genera <- str_extract(tax$Taxon, "g__[^;]+") %>%
  unique()

Now lets work through the logic:

What differentiates named genera from code-name genera?

Named genera are words in a linguistic sense. They are also "names" where they start with an upper-case letter followed by a string of lower-case letters. Notice that there are no non-alphabet characters in them.

Using our logic, we can construct this pattern [A-Z][a-z]+ which means "any capital letter from A to Z next to any lower case letters from a to z more than once". Lets check how the pattern did:

code

# Named genera starts with an upper-case letter, followed by multiple lower-case letters
str_view(genera, "[A-Z][a-z]+")

# And the tail
str_view(genera, "[A-Z][a-z]+") %>% tail()

 [1] │ g__<Chloroplast>
 [2] │ g__<Woeseia>
 [3] │ g__<Mitochondria>
 [4] │ g__<Sva>1033
 [7] │ g__<Parahaliea>
 [8] │ g__<Actibacter>
 [9] │ g__<Lutimonas>
[10] │ g__<Robiginitalea>
[11] │ g__<Anaerosolibacter>
[13] │ g__<Aquibacter>
[14] │ g__<Pseudophaeobacter>
[15] │ g__<Flavirhabdus>
[16] │ g__<Lutibacter>
[17] │ g__<Candidatus>_<Fritschea>
[18] │ g__<Sva>0081_sediment_group
[19] │ g__<Halioglobus>
[20] │ g__<Fusibacter>
[21] │ g__<Erythrobacter>
[22] │ g__<Limibacillus>
[23] │ g__<Crassaminicella>
... and 213 more

[281] │ g__<Gven>-F17
[282] │ g__<Gaetbulibacter>
[283] │ g__<Sneathiella>
[285] │ g__<Thermomarinilinea>
[286] │ g__<Dadabacteriales>
[287] │ g__<Colwellia>
[288] │ g__<Gemmatimonadaceae>
[289] │ g__<Bacteroidetes>_vadinHA17
[290] │ g__<Dechloromonas>
[291] │ g__<Vicinamibacteraceae>
[292] │ g__<Desulfobulbus>
[293] │ g__<Pelagicoccus>
[296] │ g__<Pirellula>
[297] │ g__<Hyphomicrobium>
[298] │ g__<Spirochaeta>_2
[299] │ g__[<Desulfobacterium>]_catecholicum_group
[301] │ g__<Porticoccus>
[303] │ g__<Iodidimonas>
[306] │ g__<Oceanibulbus>
[307] │ g__<Carboxylicivirga>

Notice how str_view() differentiates matches from non-matches using different colours and envelopes them in <match>. This is how it can help us build the intuition for using regex.

Our pattern looks good, but there are a few unintended matches as well:

Some have numbers suffixed immediately after the name-like genera or following a - or _
One is encased in square brackets (g__[Desulfobacterium]_catecholicum_group)

We will refine our pattern by incorporating the fact that named genera must end with a lower-case letter:

code

str_view(genera, "[A-Z][a-z]+$")
str_view(genera, "[A-Z][a-z]+$") %>% tail()

 [1] │ g__<Chloroplast>
 [2] │ g__<Woeseia>
 [3] │ g__<Mitochondria>
 [7] │ g__<Parahaliea>
 [8] │ g__<Actibacter>
 [9] │ g__<Lutimonas>
[10] │ g__<Robiginitalea>
[11] │ g__<Anaerosolibacter>
[13] │ g__<Aquibacter>
[14] │ g__<Pseudophaeobacter>
[15] │ g__<Flavirhabdus>
[16] │ g__<Lutibacter>
[17] │ g__Candidatus_<Fritschea>
[19] │ g__<Halioglobus>
[20] │ g__<Fusibacter>
[21] │ g__<Erythrobacter>
[22] │ g__<Limibacillus>
[23] │ g__<Crassaminicella>
[24] │ g__<Ulvibacter>
[25] │ g__<Gracilimonas>
... and 174 more

[296] │ g__<Pirellula>
[297] │ g__<Hyphomicrobium>
[301] │ g__<Porticoccus>
[303] │ g__<Iodidimonas>
[306] │ g__<Oceanibulbus>
[307] │ g__<Carboxylicivirga>

That looks much better! Notice also we have managed to capture Candidatus genera because our pattern matches the proposed name. This is okay as they will be on their way to being named sometime in the future. Lets now use this to count the number of named genera overall:

code

str_extract(tax$Taxon, "g__[^;]+") %>% 
  str_detect("[A-Z][a-z]+$") %>% 
  sum(na.rm = TRUE)

[1] 1490

The na.rm = TRUE argument is necessary as there is an NA generated by the code that created genera. This is from lineages that did not have valid genus-level assignments.

In summary, the code above extracted the genus-level assignments from the long lineage string, detected which of them were named, then counted those while ignoring NAs.