4.4 - Protein annotation with BLAST-like methods¶
time
- Teaching: 20 minutes
- Teaching: 30 minutes
Objectives and Key points
Objectives¶
- Know how to run a protein sequence classification using
BLASTp - Know how to run a protein sequence classification using
diamond - Understand the differences between these tools and when one might be more appropriate than the other.
Refresher on the BLAST algorithm¶
We have covered the major concepts behind the BLAST approach to sequence classification in level 1 training (materials here) so will not be covering this in detail today.
However, as a general refresher, remember that the BLAST algorithm approaches sequence classification by comparing small sections of a query sequence to a reference database, and extending the matches where they are found.
Review of BLAST steps
Where matches are found, BLAST then extends the ends of the seed one position at a time and assesses how well the seed continues to match the targets.
Matches and mismatches are recorded and the seed extenstion continues.
BLAST is also able to introduce insertions to preserve a match between query and target sequences.
In the level 1 training we applied this to nucleotide sequence data, and in practice there is very little difference BLAST-ing against nucleotide or protein sequences.
What's the difference?
Behind the scenes, BLAST uses different seed sizes for its initial matching, as protein sequences are typically shorter than nucleotide sequences and are more information-rich.
There is also a different weighting system applied when evaluating the match between query and target positions to account for whether or not the amino acid residues being compared are identical, or if their chemistry is similar. Similar to the case with BLASTn the scoring system is based on real-world observations of amino acid substitution rates, but is more complex to allow for the differing charge, size, and polarity of the amino acids.
Annotating proteins using BLASTp¶
As already mentioned, performing BLAST for protein sequences is very similar to the case for nucleotide sequences. The two differences are that we need to substitute the blastn exectuable with the blastp version, and point out command towards a reference database of protein sequences. We still load the same modules are for a nucleotide BLAST, there are just a few parts of the slurm script which need to change from the last time we ran this job.
Using BLAST databases
If you recall the level 1 traning, you will remember that NeSI periodically download and store NCBI BLAST databases in a common location, accessed through the BLASTDB/YYYY-MM module series.
Protein BLAST searches are significantly slower than nucleotide searches, so we will not be using this resource today as the queue and run times for the jobs we would need to run are not practical for the workshop today. We will instead use a local version of the UniProt SwissProt database which is much smaller.
Navigate to the /nesi/project/nesi03181/phel/<username>/level2/annotation_protein/ directory and prepare the following slurm script:
level2_blast.sl
#!/bin/bash -e
#SBATCH --account nesi03181
#SBATCH --job-name level2_blast
#SBATCH --time 00:20:00
#SBATCH --cpus-per-task 10
#SBATCH --mem 20G
#SBATCH --error level2_blast.%j.err
#SBATCH --output level2_blast.%j.out
#SBATCH --mail-type END
#SBATCH --mail-user YOUR_EMAIL
module purge
module load BLAST/2.13.0-GCC-11.3.0
cd /nesi/project/nesi03181/phel/<username>/level2/annotation_protein/
blastp -num_threads ${SLURM_CPUS_PER_TASK} -max_target_seqs 10 -evalue 1e-3 -outfmt 6 \
-db /nesi/project/nesi03181/phel/databases/swissprot_blastp/uniprot_sprot \
-query input/input_seqs.faa -out outputs/blastp.txt
Submit this job to slurm:
Accelerating protein searches using diamond¶
Given the time required to produce results with the BLASTp approach, a lot of work has been conducted to speed up the results of protein alignment searches. A very useful alternative to BLAST when working with protein data, or when trying to perform translated (nucleotide queries, protein targets) searches is diamond (Buchfink et al., 2021)
Differences in diamond versions
In previous releases, diamond gave faster results at the cost of sensitivity but the more recent release of the tool purports comparable sensitivity with BLASTp at a greatly reduced run time. See Figure 1 in the manuscript above for comparisons of sensitivity and speed between more recent releases of diamond and BLASTp.
Producing a diamond database
Unfortunately, diamond requires us to create our own classification database - there is no NCBI-supported release. However, diamond has a built-in method to modify an existing BLAST database to be compatible with diamond.
You will probably not ever need to perform this operation yourself, but the database we are using today was produced using the following commands:
With our database in place, the command for running diamond is very similar (by design) to BLASTp:
level2_blast.sl
#!/bin/bash -e
#SBATCH --account nesi03181
#SBATCH --job-name level2_diamond
#SBATCH --time 00:10:00
#SBATCH --cpus-per-task 16
#SBATCH --mem 20G
#SBATCH --error level2_diamond.%j.err
#SBATCH --output level2_diamond.%j.out
#SBATCH --mail-type END
#SBATCH --mail-user YOUR_EMAIL
module purge
module load DIAMOND/2.1.6-GCC-11.3.0
cd /nesi/project/nesi03181/phel/<username>/level2/annotation_protein/
diamond blastp --threads ${SLURM_CPUS_PER_TASK} --ultra-sensitive --max-target-seqs 10 --evalue 1e-3 --outfmt 6 \
--db /nesi/project/nesi03181/phel/databases/swissprot_dmnd/uniprot_sprot.dmnd \
--query input/input_seqs.faa --out outputs/diamond.txt
Submit this job to slurm:
Comparing the outputs¶
In both of the jbos above, we specified the tab-delimited BLAST6 output format, which is a table of the form;
| Column header | Meaning |
|---|---|
| qseqid | Sequence ID of the query sequence (input file) |
| sseqid | Sequence ID of the target sequence (reference database) |
| pident | Percentage of identical positions between query and target |
| length | Alignment length (sequence overlap) of the common region between query and target |
| mismatch | Number of mismatches between query and target |
| gapopen | Number of gap openings in the alignment |
| qstart | Position in the query sequence where alignment begins |
| qend | Position in the query sequence where alignment ends |
| sstart | Position in the target sequence where alignment begins |
| send | Position in the target sequence where alignment ends |
| evalue | The E-value for the query/target match, as described above |
| bitscore | The bit score for the query/target match, as described above |
As there are around 3,000 proteins in our input file which were classified, we are just going to focus on a single sequence and how it's annotation differs between BLASTp and diamond.
Exercise
Use your knowledge the command line to print all results for the query sequence 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 in both output files. You can either keep the results on the command line, or redirect them into files for analysis.
Solution
grep "46ee54c6-57d2-2d22-f454-9a5737db31e7_1" outputs/blastp.txt
grep "46ee54c6-57d2-2d22-f454-9a5737db31e7_1" outputs/diamond.txt
outputs/blastp.txt
| qseqid | sseqid | pident | length | mismatch | gapopen | qstart | qend | sstart | send | evalue | bitscore |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P17779|RDRP_PVXX3 | 100.000 | 719 | 0 | 0 | 1 | 719 | 240 | 958 | 0.0 | 1499 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P09395|RDRP_PVX | 97.218 | 719 | 20 | 0 | 1 | 719 | 240 | 958 | 0.0 | 1462 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|Q07630|RDRP_PVXHB | 83.449 | 719 | 119 | 0 | 1 | 719 | 240 | 958 | 0.0 | 1256 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P22591|RDRP_PVXCP | 82.476 | 719 | 126 | 0 | 1 | 719 | 240 | 958 | 0.0 | 1237 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P20951|RDRP_PMV | 48.563 | 348 | 174 | 4 | 370 | 717 | 699 | 1041 | 2.79e-97 | 332 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P20951|RDRP_PMV | 41.489 | 188 | 110 | 0 | 1 | 188 | 230 | 417 | 7.28e-39 | 159 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P09498|RDRP_WCMVM | 50.140 | 357 | 168 | 6 | 364 | 719 | 446 | 793 | 3.73e-96 | 327 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P09498|RDRP_WCMVM | 40.556 | 180 | 104 | 2 | 14 | 193 | 235 | 411 | 4.56e-30 | 131 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P15402|RDRP_WCMVO | 49.860 | 357 | 169 | 6 | 364 | 719 | 446 | 793 | 2.67e-94 | 322 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P15402|RDRP_WCMVO | 38.776 | 196 | 116 | 3 | 15 | 210 | 236 | 427 | 1.67e-29 | 129 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|Q07518|RDRP_P1AMV | 47.238 | 362 | 180 | 8 | 363 | 719 | 531 | 886 | 1.71e-88 | 306 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|Q07518|RDRP_P1AMV | 41.765 | 170 | 96 | 3 | 14 | 182 | 236 | 403 | 7.82e-31 | 134 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|Q918W3|RDRP_ICRSV | 48.986 | 345 | 171 | 5 | 375 | 719 | 821 | 1160 | 1.13e-87 | 305 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|Q918W3|RDRP_ICRSV | 26.070 | 257 | 175 | 4 | 14 | 268 | 242 | 485 | 1.60e-20 | 100 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P28897|RDRP_SMYEA | 48.387 | 341 | 170 | 6 | 380 | 719 | 493 | 828 | 1.08e-85 | 298 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P28897|RDRP_SMYEA | 32.524 | 206 | 129 | 3 | 15 | 211 | 254 | 458 | 1.05e-24 | 114 |
outputs/diamond.txt
| qseqid | sseqid | pident | length | mismatch | gapopen | qstart | qend | sstart | send | evalue | bitscore |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P17779|RDRP_PVXX3 | 100 | 719 | 0 | 0 | 1 | 719 | 240 | 958 | 0.0 | 1425 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P09395|RDRP_PVX | 97.2 | 719 | 20 | 0 | 1 | 719 | 240 | 958 | 0.0 | 1390 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|Q07630|RDRP_PVXHB | 83.4 | 719 | 119 | 0 | 1 | 719 | 240 | 958 | 0.0 | 1197 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P22591|RDRP_PVXCP | 82.5 | 719 | 126 | 0 | 1 | 719 | 240 | 958 | 0.0 | 1179 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P20951|RDRP_PMV | 34.8 | 819 | 425 | 15 | 1 | 717 | 230 | 1041 | 8.77e-125 | 409 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P09498|RDRP_WCMVM | 37.3 | 707 | 294 | 12 | 14 | 719 | 235 | 793 | 5.69e-117 | 384 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P15402|RDRP_WCMVO | 36.7 | 706 | 298 | 12 | 15 | 719 | 236 | 793 | 7.69e-115 | 378 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|Q07518|RDRP_P1AMV | 38.5 | 716 | 365 | 23 | 14 | 719 | 236 | 886 | 4.72e-114 | 377 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P28897|RDRP_SMYEA | 34.2 | 710 | 323 | 13 | 15 | 717 | 254 | 826 | 5.27e-103 | 346 |
| 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 | sp|P15095|RDRP_NMV | 32.2 | 860 | 419 | 23 | 15 | 719 | 243 | 1093 | 5.58e-100 | 340 |
Comparing the outputs, you can see that diamond produced slightly fewer hits than BLASTp, but the top four results for both methods are the same:
| Target | BLASTpIdentity |
E-value |
Bitscore |
diamondIdentity |
E-value |
Bitscore |
|---|---|---|---|---|---|---|
| sp|P17779|RDRP_PVXX3 | 100.000 | 0 | 1,499 | 100.0 | 0 | 1,425 |
| sp|P09395|RDRP_PVX | 97.218 | 0 | 1,462 | 97.2 | 0 | 1,390 |
| sp|Q07630|RDRP_PVXHB | 83.449 | 0 | 1,256 | 83.4 | 0 | 1,197 |
| sp|P22591|RDRP_PVXCP | 82.476 | 0 | 1,237 | 82.5 | 0 | 1,179 |
We can check the complete annotation of this protein to determine its function and organism from which the reference sequence originated.
Exercise
Search the top result(s) through the UniProt search tool and report the most likely function of this sequence, and the organism that it was obtained from.
Solution
| Sequence | Function | Organism |
|---|---|---|
| sp|P17779|RDRP_PVXX3 | RNA-directed RNA polymerase | Potato virus X (strain X3) |
| sp|P09395|RDRP_PVX | RNA-directed RNA polymerase | Potato virus X |
| sp|Q07630|RDRP_PVXHB | RNA-directed RNA polymerase | Potato virus X |
| sp|P22591|RDRP_PVXCP | RNA-directed RNA polymerase | Potato virus X (strain CP) |



