Skip to content

APPENDIX (ex11) : Viral taxonomy prediction via vContact2

Prior to running this, you need to have run VirSorter2 and CheckV as indicated here.

1. Predict genes via prodigal-gv

Prodigal-gv is a variant of prodigal that has been modified to improve gene calling for giant viruses and viruses that use alternative genetic codes. More information is available here.

Note

If you have already run DRAM-v to annotate viral genes, DRAM-v runs prodigal-gv in the background, so you can also use the genes.faa file generated by DRAM-v as input for vConTACT2 here.

code

# Navigate to working directory
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/7.viruses

# Create output directory
mkdir -p viral_taxonomy

# Create combined nucleotide sequence file
cat checkv_out/*.fna > checkv_combined.fna

Example slurm script:

code

#!/bin/bash -e

#SBATCH --account       nesi02659
#SBATCH --job-name      prodigal
#SBATCH --partition     milan
#SBATCH --time          00:05:00
#SBATCH --mem           1GB
#SBATCH --cpus-per-task 2
#SBATCH --error         %x_%j.err
#SBATCH --output        %x_%j.out


# Load dependencies
module purge
module load prodigal-gv/2.9.0-GCC-11.3.0

# Set up working directories
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/7.viruses

# Run Prodigal-gv to predict viral genes 
prodigal-gv -p meta -q \
         -i checkv_combined.fna \
         -a viral_taxonomy/checkv_combined.faa 

Deprecation warning

When running the above, you might get a warning:

/opt/conda/bin/vcontact2_gene2genome:174: DeprecationWarning: 'U' mode is deprecated

This is okay and will not impact the outcome.

2. Generate required mapping file for vContact2

Use vContact2's vcontact2_gene2genome script to generate the required mapping file from the output of prodigal.

code

# Ensure you are in the correct working directory
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/7.viruses

# Load modules
module purge
module unload XALT
module load Apptainer/1.2.5 \
        MCL/14.137-gimkl-2020a \
        DIAMOND/2.1.9-GCC-11.3.0

# All NeSI Filesystems are pre-bound to APPTAINER_BIND
container=/opt/nesi/containers/vContact2

# Run script
apptainer run $container/vcontact2.simg \
vcontact2_gene2genome --proteins viral_taxonomy/checkv_combined.faa \
                      --output viral_taxonomy/viral_genomes_g2g.csv \
                      -s 'Prodigal-FAA'

3. Run vContact2

code

#!/bin/bash -e

#SBATCH --account       nesi02659
#SBATCH --job-name      vConTACT2
#SBATCH --partition     milan
#SBATCH --time          02:00:00
#SBATCH --mem           20GB
#SBATCH --cpus-per-task 16
#SBATCH --error         %x_%j.err
#SBATCH --output        %x_%j.out

# Working directory
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/7.viruses

# Load modules
module purge
module unload XALT
module load Apptainer/1.2.5 \
        MCL/14.137-gimkl-2020a \
        DIAMOND/2.1.9-GCC-11.3.0

# Bind paths
container=/opt/nesi/containers/vContact2

# Run vConTACT2
apptainer run $container/vcontact2.simg \
vcontact2 --raw-proteins viral_taxonomy/checkv_combined.faa \
          --rel-mode Diamond \
          --threads $SLURM_CPUS_PER_TASK \
          --proteins-fp viral_taxonomy/viral_genomes_g2g.csv \
          --db 'ProkaryoticViralRefSeq201-Merged' \
          --c1-bin /opt/conda/bin/cluster_one-1.0.jar \
          --output-dir vConTACT2_Results

4. Predict taxonomy of viral contigs based on output of vContact2

vContact2 doesn't actually assign taxonomy to your input viral contigs. It instead provides an output outlining which reference viral genomes your viral contigs clustered with (if they clustered with any at all). Based on how closely they clustered with any reference genome(s), you can then use this to predict the likely taxonomy of the contig.

From the vContact2 online docs:

One important note is that the taxonomic information is not included for user sequences. This means that each user will need to find their genome(s) of interest and check to see if reference genomes are located in the same VC. If the user genome is within the same VC sub-cluster as a reference genome, then there's a very high probability that the user genome is part of the same genus. If the user genome is in the same VC but not the same sub-cluster as a reference, then it's highly likely the two genomes are related at roughly genus-subfamily level. If there are no reference genomes in the same VC or VC sub-cluster, then it's likely that they are not related at the genus level at all.

The summary output of vContact2 is the file vConTACT2_Results/genome_by_genome_overview.csv. As the comment above notes, one approach would be to search this file for particular contigs of interest, and see if any reference genomes fall into the same viral cluster (VC), using this reference to predict the taxonomy of the contig of interest.

The following python script is effectively an automated version of this for all input contigs (Note: this script has not been widely tested, and so should be used with some degree of caution). This script groups together contigs (and reference genomes) that fall into each VC, and then for each, outputs a list of all taxonomies (at the ranks of 'Order', 'Family', and 'Genus', separately) that were found in that cluster. The predictions (i.e. the list of all taxonomies found in the same VC) for each rank and each contig is output to the table tax_predict_table.txt.

Note

The taxonomies are deliberately enclosed in square brackets ([ ]) to highlight the fact that these are predictions, rather than definitive taxonomy assignments.

For future reference, a copy of this script is available for download here

Note

This script was written for vConTACT2_0.9.19 and vConTACT2_0.11.3, and has not been tested with later versions.

A comparable script for earlier versions of vConTACT2 is available here

code

module purge
module load Python/3.11.6-foss-2023a

cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/7.viruses/

./tax_predict_vConTACT2_0.9.19.py \
-i vConTACT2_Results/genome_by_genome_overview.csv \
-o vConTACT2_Results/