Gene annotation I: BLAST-like and HMM¶
Objectives
Annotation methods¶
Broadly speaking, there are two ways we perform gene annotations with protein sequences. Both compare our sequences of interest against a curated set of protein sequences for which function is known, or is strongly suspected. In each case, there are particular strengths to the approach and for particular research questions, one option may be favoured over another.
BLAST
-like annotation¶
The first of these is the BLAST
algorithm for sequence alignment. This approach performs pairwise alignment between the gene of interest (query sequence) and the sequences in the database (target sequence). BLAST
searches each potential target sequence for k-mers identified in the query sequence. Where these k-mers are found in targets, the ends are extended out to try to create longer regions of highly similar sequence spans. Across this span, the tool identifies the longest span of characters (nucleotide or amino acid) that match within a scoring framework to return the length of the region (coverage) and the sequence identity over the span (identity).
The original tool for performing this kind of analysis was the BLAST
tool. While BLAST
and its variants are still excellent tools for performing this kind of sequence annotation, they suffer from a slow runtime speed due to the need to test each query sequence against every target sequence in the database. For this reason, several tools have been published which take the basic approach of BLAST
, but augment it with methods to reduce the number of pairwise comparisons needed to identify targets with high sequence similarity to the query. Two popular pieces of software are the tools USEARCH
and DIAMOND
.
HMM-profiling of domains¶
An alternate method for attributing function to query sequences is to consider them as a collection of independently functioning protein folding domains. This is the approach used in the HMMER software, and the Pfam, TIGRfam, and PANTHER databases. In these analyses, the database consists not of individual sequences, but of Hidden Markov models built from a collection of proteins that share a common domain. These profiles build out a statistical map of the amino acid transitions (from position to position), variations (differences at a position), and insertions/deletions between positions in the domain across the different observations in the training database and apply these maps to the query data.
These exercises will take place in the 10.gene_annotation_and_coverage/
folder.
Navigate to working directory
Annotating MAGs against the UniProt database with DIAMOND
¶
For this exercise we are going to use DIAMOND
for performing our annotation. We have chosen to use this tool because it is faster than BLAST
, and USEARCH
comes with licensing restrictions that make it hard to work with in a shared computing environment like NeSI.
For this exercise we have created a diamond-compatible database from the 2024 release of the UniProt database.
For input files, the predictions/
results from the previous gene prediction exercise have been copied over to 10.gene_annotation_and_coverage/predictions/
.
In general, diamond
takes a pair of input files - the protein coding sequences we wish to annotate and the database we will use for this purpose. There are a few parameters that need to be tweaked for obtaining a useful output file, however.
Terminal output
diamond v2.0.15.153 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
Syntax: diamond COMMAND [OPTIONS]
Commands:
...
blastp Align amino acid query sequences against a protein reference database
...
General options:
--threads (-p) number of CPU threads
--db (-d) database file
--out (-o) output file
--outfmt (-f) output format
...
There are two output formats we can chose from which are useful for our analysis. We will obtain our output in the BLAST tabular format, which provides the annotation information in a simple-to-parse text file that can be viewed in any text or spreadsheet viewing tool. This will allow us to investigate and evaluate the quality of our annotations.
Awkwardly, DIAMOND
does not provide the headers for what the columns in the output table mean. This table is a handy reference for how to interpret the output.
From here we can view important statistics for each query/target pairing such as the number of identical residues between sequences and the aligned length between query and target.
Before we begin, we need to create an directory for outputs.
Now, lets set up a slurm job to annotate each of our MAGs.
Create a new script
Remember to update
code
Annotating MAGs against the Pfam database with HMMER
¶
The standard software for performing HMM-profiling annotation is HMMER. Compared to BLAST
, FASTA
, and other sequence alignment and database search tools based on older scoring methodology, HMMER
aims to be significantly more accurate and more able to detect remote homologs because of the strength of its underlying mathematical models. In the past, this strength came at significant computational expense, but in the new HMMER3
project, HMMER
is now essentially as fast as BLAST
.
HMMER
will search one or more profiles against a sequence database for sequence hommologs, and for making sequence alignments, implementing profile hidden Markov models. In this exercise, we will perform a search using hmmsearch
. For each profile in hmmfile, HMMER
uses that query profile to search the target database of sequences indicated in seqdb, and output ranked lists of the sequences with the most significant matches to the profile. hmmsearch
accepts any fastA file as target database input. It also accepts EMBL/UniProtKB text format, and Genbank format. It will automatically determine what format your file is in so you don’t have to specify it.
As we did with diamond
, we will also have to modify some parameters to get the desired ouotput.
hmmsearch -h
hmmsearch :: search profile(s) against a sequence database
HMMER 3.3.2 (Nov 2020); http://hmmer.org/
Copyright (C) 2020 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmsearch [options] <hmmfile> <seqdb>
Basic options:
-h : show brief help on version and usage
Options directing output:
...
--tblout <f> : save parseable table of per-sequence hits to file <f>
....
Options controlling reporting thresholds:
...
-E <x> : report sequences <= this E-value threshold in output [10.0] (x>0)
...
Other expert options:
...
--cpu <n> : number of parallel CPU workers to use for multithreads
...
We are now going to submit another slurm job to annotate our MAGs using the Pfam database. Pfam used to have a standalone website, but it has recently been integrated into InterPro maintained by the European Bioinformatics Institute (see announcement). Matching sequences to a Pfam
entry allows us to transfer the functional information from an experimentally characterised sequence to uncharacterised sequences in the same entry. Pfam
then provides comprehensive annotation for each entry.
Remember to update
code
Annotating signal peptides of predicted genes using SignalP6
¶
In addition to putative function, we can also predict whether a protein is secreted or not using SignalP
. In the 5th version, users needed to know the Gram status of the organism for the software to predict signal peptides. In 2022, the developers (Teufel et al., 2022) produced a 6th version that uses protein language models (like ChatGPT, for proteins!) to predict the presence of signal peptides across the domains of life.
SignalP6
has several models depending on required accuracy and hardware availability. Here, we are running the fast model on CPU. There are models built for GPU (they are interconvertible and the software package comes with utilities for easy conversion) that is blazingly fast (most language models require GPUs for accelerations). However, GPUs on NeSI are a coveted resource. For our small dataset, the fast model on CPU is sufficient as an example.
SignalP6
is licensed software
In terms of computational capability to detect signal peptides, there is nothing like SignalP6. It is free for academic use and other models can be requested from the developers. If you are not running this on NeSI, you will need to acquire your own license.
code
SignalP
will probably TIMEOUT
The above script will almost, inevitably, time out before completing every MAG. However, that is not critical for our purposes today. We simply need at least one of the array jobs to complete (several will) for us to explore the output.
The output of the script above are directories per bin that contain 5 files:
predicted_results.txt
the main output with gene ID and the predicted signal peptide ("Other" means none predicted), the prediction probability (i.e., confidence), and the cleavage siteoutput.gff3
A GFF3-format output of sequences with start and end positions of signal peptides relative to each amino acid sequenceoutput.json
Similar to above, with more information about run parametersprocessed_entries.fasta
Amino acid sequences with the signal peptide cleaved off (i.e., predicted mature proteins)region_output.gff3
Signal peptides have specific regions characteristic of each type of signal peptide. These usually inform us about cleavage sites for those proteins. This output provides this information.
--format none
If you choose to run the above script with other formats (e.g., txt, png, all), per-sequence data outputs and prediction plots (if requested) will be produced. For large numbers of amino acid sequences, this can be problematic as it uses up a lot of space. Unless you're needing to query specific parts of a few proteins, it is advised to stick to "none" to output summaries only.
Evaluating the quality of gene assignment¶
Determining how trustworthy a gene annotation is can be a very tricky process. How similar do protein sequences need to be to perform the same function? The answer is surprisingly low. A bioinformatic analysis performed in 1999 identified that proteins with as little as 20 - 35% sequence identity can still share the same function (Rost, 1999), but this is not a universal occurrence. When evaluating annotations, consider the following questions:
- What is the amino acid identity along the aligned region?
- What is the amino acid similarity between the aligned region?
- What is the coverage as a percentage of the query and target genes?
- If we infer a phylogeny of this query gene with references from the target family, is a stable tree resolved?
- Does the inclusion of this gene function make sense in the context of the organism's taxonomy?
- Does the gene sit on a long contig that is core to the MAG, or is it a short contig carrying only a single gene?
- If we are uncertain of a particular annotation, does the predicted gene occur in an operon? If so, are the other genes present in the annotation?
We must also remain aware of the potential for incorrectly annotated genes in the annotation database and that proteins can perform multiple functions (and may therefore be attributed multiple, inconsistent annotations). Furthermore, it is also important to consider exactly which part of the target gene the alignment is happening across. There are several catalytic centers of enzymes, such as the Fe-S cofactor, which are shared across many different proteins, and if your annotation is only spanning one of these regions then it may simply be the case that you are identifying a generic electron accepting or donating domain.
Differences in taxonomies¶
Another way to determine if an annotation 'belongs' in the MAG of interest is to consider the predicted taxonomy of the query gene with that of the MAG itself. For example, if you detect a Desulfovibrio-like dsrA sequence in a bin that has been classified as belonging to the genus Desulfovibrio then it is probably a safe bet that the annotation is correct.
However, when comparing taxonomic assignments, it is important to be aware of the differing taxonomic schemas that are circulating in the microbiological and bioinformatic literature and to know how to reconcile their differences. Similar to how the 16S rRNA gene taxonomies provided by SILVA, Greengenes, and RDP taxonomies all differ in some aspects, there are multiple competing taxonomies in protein databases.
This problem exists despite the existence of a formal Code for the naming of bacteria and archaea, because
- There are no rules governing how we define the grouping of these names together, other than for type species
- Defunct synonyms and basonyms are not correctly purged from taxonomy lists (this is quite noticeable with the NCBI taxonomy)
- Valid names cannot be assigned for uncultivated organisms, meaning there are many informal placeholder names in the literature. For example, clades like WPS-2, SAR324, and SAUL are widely cited in the literature despite having no official standing
It is therefore important to periodically sanity check your taxonomic annotations in order to avoid splitting taxa based on spelling differences or the use of historic names that have since been reclassified.