4.4 - Protein annotation with `BLAST`-like methods¶

time

Teaching: 20 minutes
Teaching: 30 minutes

Objectives and Key points

Objectives¶

Know how to run a protein sequence classification using BLASTp
Know how to run a protein sequence classification using diamond
Understand the differences between these tools and when one might be more appropriate than the other.

Refresher on the `BLAST` algorithm¶

We have covered the major concepts behind the BLAST approach to sequence classification in level 1 training (materials here) so will not be covering this in detail today.

However, as a general refresher, remember that the BLAST algorithm approaches sequence classification by comparing small sections of a query sequence to a reference database, and extending the matches where they are found.

Review of BLAST steps

Where matches are found, BLAST then extends the ends of the seed one position at a time and assesses how well the seed continues to match the targets.

Matches and mismatches are recorded and the seed extenstion continues.

BLAST is also able to introduce insertions to preserve a match between query and target sequences.

In the level 1 training we applied this to nucleotide sequence data, and in practice there is very little difference BLAST-ing against nucleotide or protein sequences.

What's the difference?

Behind the scenes, BLAST uses different seed sizes for its initial matching, as protein sequences are typically shorter than nucleotide sequences and are more information-rich.

There is also a different weighting system applied when evaluating the match between query and target positions to account for whether or not the amino acid residues being compared are identical, or if their chemistry is similar. Similar to the case with BLASTn the scoring system is based on real-world observations of amino acid substitution rates, but is more complex to allow for the differing charge, size, and polarity of the amino acids.

Annotating proteins using `BLASTp`¶

As already mentioned, performing BLAST for protein sequences is very similar to the case for nucleotide sequences. The two differences are that we need to substitute the blastn exectuable with the blastp version, and point out command towards a reference database of protein sequences. We still load the same modules are for a nucleotide BLAST, there are just a few parts of the slurm script which need to change from the last time we ran this job.

Using BLAST databases

If you recall the level 1 traning, you will remember that NeSI periodically download and store NCBI BLAST databases in a common location, accessed through the BLASTDB/YYYY-MM module series.

Protein BLAST searches are significantly slower than nucleotide searches, so we will not be using this resource today as the queue and run times for the jobs we would need to run are not practical for the workshop today. We will instead use a local version of the UniProt SwissProt database which is much smaller.

Navigate to the /nesi/project/nesi03181/phel/<username>/level2/annotation_protein/ directory and prepare the following slurm script:

level2_blast.sl

#!/bin/bash -e
#SBATCH --account       nesi03181
#SBATCH --job-name      level2_blast
#SBATCH --time          00:20:00
#SBATCH --cpus-per-task 10
#SBATCH --mem           20G
#SBATCH --error         level2_blast.%j.err
#SBATCH --output        level2_blast.%j.out
#SBATCH --mail-type     END
#SBATCH --mail-user     YOUR_EMAIL

module purge
module load BLAST/2.13.0-GCC-11.3.0

cd /nesi/project/nesi03181/phel/<username>/level2/annotation_protein/

blastp -num_threads ${SLURM_CPUS_PER_TASK} -max_target_seqs 10 -evalue 1e-3 -outfmt 6 \
    -db /nesi/project/nesi03181/phel/databases/swissprot_blastp/uniprot_sprot \
    -query input/input_seqs.faa -out outputs/blastp.txt

Submit this job to slurm:

code

sbatch level2_blast.sl

Output

Submitted batch job XXXXXXXX

Accelerating protein searches using `diamond`¶

Given the time required to produce results with the BLASTp approach, a lot of work has been conducted to speed up the results of protein alignment searches. A very useful alternative to BLAST when working with protein data, or when trying to perform translated (nucleotide queries, protein targets) searches is diamond (Buchfink et al., 2021)

Differences in diamond versions

In previous releases, diamond gave faster results at the cost of sensitivity but the more recent release of the tool purports comparable sensitivity with BLASTp at a greatly reduced run time. See Figure 1 in the manuscript above for comparisons of sensitivity and speed between more recent releases of diamond and BLASTp.

Producing a diamond database

Unfortunately, diamond requires us to create our own classification database - there is no NCBI-supported release. However, diamond has a built-in method to modify an existing BLAST database to be compatible with diamond.

You will probably not ever need to perform this operation yourself, but the database we are using today was produced using the following commands:

code

module purge
module load BLASTDB/2023-10 DIAMOND/2.1.6-GCC-11.3.0

mkdir /nesi/project/nesi03181/phel/databases/diamond_nr/
cd /nesi/project/nesi03181/phel/databases/diamond_nr/

cp ${BLASTDB}/nr* diamond_nr/
diamond prepdb -d diamond_nr/nr

With our database in place, the command for running diamond is very similar (by design) to BLASTp:

level2_blast.sl

#!/bin/bash -e
#SBATCH --account       nesi03181
#SBATCH --job-name      level2_diamond
#SBATCH --time          00:10:00
#SBATCH --cpus-per-task 16
#SBATCH --mem           20G
#SBATCH --error         level2_diamond.%j.err
#SBATCH --output        level2_diamond.%j.out
#SBATCH --mail-type     END
#SBATCH --mail-user     YOUR_EMAIL

module purge
module load DIAMOND/2.1.6-GCC-11.3.0

cd /nesi/project/nesi03181/phel/<username>/level2/annotation_protein/

diamond blastp --threads ${SLURM_CPUS_PER_TASK} --ultra-sensitive --max-target-seqs 10 --evalue 1e-3 --outfmt 6 \
    --db /nesi/project/nesi03181/phel/databases/swissprot_dmnd/uniprot_sprot.dmnd \
    --query input/input_seqs.faa --out outputs/diamond.txt

Submit this job to slurm:

code

sbatch level2_blast.sl

Output

Submitted batch job XXXXXXXX

Comparing the outputs¶

In both of the jbos above, we specified the tab-delimited BLAST6 output format, which is a table of the form;

Column header	Meaning
qseqid	Sequence ID of the query sequence (input file)
sseqid	Sequence ID of the target sequence (reference database)
pident	Percentage of identical positions between query and target
length	Alignment length (sequence overlap) of the common region between query and target
mismatch	Number of mismatches between query and target
gapopen	Number of gap openings in the alignment
qstart	Position in the query sequence where alignment begins
qend	Position in the query sequence where alignment ends
sstart	Position in the target sequence where alignment begins
send	Position in the target sequence where alignment ends
evalue	The E-value for the query/target match, as described above
bitscore	The bit score for the query/target match, as described above

As there are around 3,000 proteins in our input file which were classified, we are just going to focus on a single sequence and how it's annotation differs between BLASTp and diamond.

Exercise

Use your knowledge the command line to print all results for the query sequence 46ee54c6-57d2-2d22-f454-9a5737db31e7_1 in both output files. You can either keep the results on the command line, or redirect them into files for analysis.

Solution

grep "46ee54c6-57d2-2d22-f454-9a5737db31e7_1" outputs/blastp.txt 
grep "46ee54c6-57d2-2d22-f454-9a5737db31e7_1" outputs/diamond.txt

outputs/blastp.txt

qseqid	sseqid	pident	length	mismatch	gapopen	qstart	qend	sstart	send	evalue	bitscore
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P17779\|RDRP_PVXX3	100.000	719	0	0	1	719	240	958	0.0	1499
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P09395\|RDRP_PVX	97.218	719	20	0	1	719	240	958	0.0	1462
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|Q07630\|RDRP_PVXHB	83.449	719	119	0	1	719	240	958	0.0	1256
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P22591\|RDRP_PVXCP	82.476	719	126	0	1	719	240	958	0.0	1237
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P20951\|RDRP_PMV	48.563	348	174	4	370	717	699	1041	2.79e-97	332
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P20951\|RDRP_PMV	41.489	188	110	0	1	188	230	417	7.28e-39	159
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P09498\|RDRP_WCMVM	50.140	357	168	6	364	719	446	793	3.73e-96	327
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P09498\|RDRP_WCMVM	40.556	180	104	2	14	193	235	411	4.56e-30	131
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P15402\|RDRP_WCMVO	49.860	357	169	6	364	719	446	793	2.67e-94	322
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P15402\|RDRP_WCMVO	38.776	196	116	3	15	210	236	427	1.67e-29	129
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|Q07518\|RDRP_P1AMV	47.238	362	180	8	363	719	531	886	1.71e-88	306
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|Q07518\|RDRP_P1AMV	41.765	170	96	3	14	182	236	403	7.82e-31	134
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|Q918W3\|RDRP_ICRSV	48.986	345	171	5	375	719	821	1160	1.13e-87	305
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|Q918W3\|RDRP_ICRSV	26.070	257	175	4	14	268	242	485	1.60e-20	100
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P28897\|RDRP_SMYEA	48.387	341	170	6	380	719	493	828	1.08e-85	298
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P28897\|RDRP_SMYEA	32.524	206	129	3	15	211	254	458	1.05e-24	114

outputs/diamond.txt

qseqid	sseqid	pident	length	mismatch	gapopen	qstart	qend	sstart	send	evalue	bitscore
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P17779\|RDRP_PVXX3	100	719	0	0	1	719	240	958	0.0	1425
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P09395\|RDRP_PVX	97.2	719	20	0	1	719	240	958	0.0	1390
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|Q07630\|RDRP_PVXHB	83.4	719	119	0	1	719	240	958	0.0	1197
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P22591\|RDRP_PVXCP	82.5	719	126	0	1	719	240	958	0.0	1179
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P20951\|RDRP_PMV	34.8	819	425	15	1	717	230	1041	8.77e-125	409
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P09498\|RDRP_WCMVM	37.3	707	294	12	14	719	235	793	5.69e-117	384
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P15402\|RDRP_WCMVO	36.7	706	298	12	15	719	236	793	7.69e-115	378
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|Q07518\|RDRP_P1AMV	38.5	716	365	23	14	719	236	886	4.72e-114	377
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P28897\|RDRP_SMYEA	34.2	710	323	13	15	717	254	826	5.27e-103	346
46ee54c6-57d2-2d22-f454-9a5737db31e7_1	sp\|P15095\|RDRP_NMV	32.2	860	419	23	15	719	243	1093	5.58e-100	340

Comparing the outputs, you can see that diamond produced slightly fewer hits than BLASTp, but the top four results for both methods are the same:

Target	`BLASTp` Identity	Bitscore	`diamond` Identity	Bitscore
sp\|P17779\|RDRP_PVXX3	100.000	1,499	100.0	1,425
sp\|P09395\|RDRP_PVX	97.218	1,462	97.2	1,390
sp\|Q07630\|RDRP_PVXHB	83.449	1,256	83.4	1,197
sp\|P22591\|RDRP_PVXCP	82.476	1,237	82.5	1,179

We can check the complete annotation of this protein to determine its function and organism from which the reference sequence originated.

Exercise

Search the top result(s) through the UniProt search tool and report the most likely function of this sequence, and the organism that it was obtained from.

Solution

Sequence	Function	Organism
sp\|P17779\|RDRP_PVXX3	RNA-directed RNA polymerase	Potato virus X (strain X3)
sp\|P09395\|RDRP_PVX	RNA-directed RNA polymerase	Potato virus X
sp\|Q07630\|RDRP_PVXHB	RNA-directed RNA polymerase	Potato virus X
sp\|P22591\|RDRP_PVXCP	RNA-directed RNA polymerase	Potato virus X (strain CP)

4.4 - Protein annotation with BLAST-like methods¶

Objectives¶

Refresher on the BLAST algorithm¶

Annotating proteins using BLASTp¶

Accelerating protein searches using diamond¶

Comparing the outputs¶

4.4 - Protein annotation with `BLAST`-like methods¶

Refresher on the `BLAST` algorithm¶

Annotating proteins using `BLASTp`¶

Accelerating protein searches using `diamond`¶