Identifying viral contigs in metagenomic data¶

Objectives

Identifying viral contigs
Identifying viral contigs using VirSorter2

Identifying viral contigs¶

Viral metagenomics is a rapidly progressing field, and new software are constantly being developed and released each year that aim to better identify and characterise viral genomic sequences from assembled metagenomic sequence reads.

Currently, the most commonly used methods are VirSorter2, VIBRANT, and VirFinder (or the machine learning implementation of this, DeepVirFinder). A number of recent studies use one of these tools or a combination of several at once.

VirSorter2VIBRANTDeepVirFinder

Uses a predicted protein homology reference database-based approach, together with searching for a number of pre-defined metrics based on known viral genomic features. VirSorter2 includes dsDNAphage, ssDNA, and RNA viruses, and the viral groups Nucleocytoviricota and lavidaviridae.*

More info

Uses a machine learning approach based on protein similarity (non-reference-based similarity searches with multiple HMM sets), and is in principle applicable to bacterial and archaeal DNA and RNA viruses, integrated proviruses (which are excised from contigs by VIBRANT), and eukaryotic viruses.

More info

Uses a machine learning based approach based on k-mer frequencies. Having developed a database of the differences in k-mer frequencies between prokaryote and viral genomes, VirFinder examines assembled contigs and identifies whether their k-mer frequencies are comparable to known viruses in the database, using this to predict viral genomic sequence. This method has some limitations based on the viruses that were included when building the database (bacterial DNA viruses, but very few archaeal viruses, and, at least in some versions of the software, no eukaryotic viruses). However, tools are also provided to build your own database should you wish to develop an expanded one. Due to its distinctive k-mer frequency-based approach, VirFinder may also have the capability of identifying some novel viruses overlooked by tools such as VIBRANT or VirSorter.

More info

DeepVirFinder GitHub

Identifying viral contigs using `VirSorter2`¶

For this exercise, we will use VirSorter2 to identify viral contigs from our assembled contigs. We can also use VirSorter2 to prepare files for later use with the gene annotation tool DRAM-v, which we'll run later in the day.

Checking quality and estimate completeness of the viral contigs via `CheckV`¶

CheckV was developed as an analogue to CheckM. CheckV first performs a 'contaminating sequence' trim, removing any retained (prokaryote) host sequence on the end of contigs with integrated prophage, and then assesses the quality and completeness of the assembled viral contigs. The quality of the contigs are also categoriesed based on the recently developed Minimum Information about an Unclutivated Virus Genome (MIUViG) standards for reporting sequences of unclutivated virus geneomes (such as those recovered from metagenomic sequencing data). The MIUViG were developed as an extension of the Minimum Information about any (x) Sequence (MIxS) standards, which include, among others, standards for Metagenome-Assembled Genomes (MIMAG).

Run `VirSorter2` and `CheckV`¶

These exercises will take place in the 7.viruses/ folder.

Navigate to working directory

cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/7.viruses/

For VirSorter2, we will input the assembled contigs from the SPAdes assembly we performed earlier. These assembly files have been copied to 7.viruses/spades_assembly/ for this exercise.

We will then run CheckV in the same script, providing the FASTA file of viral contigs output by VirSorter2 as input (final-viral-combined.fa).

Create a script named VirSorter2_and_checkv.sl

nano VirSorter2_and_checkv.sl

Remember to update <YOUR FOLDER> to your own folder.

code

#!/bin/bash -e
#SBATCH --account       nesi02659
#SBATCH --job-name      VirSorter2_and_checkv
#SBATCH --partition     milan
#SBATCH --time          05:00:00
#SBATCH --mem           2GB
#SBATCH --cpus-per-task 28
#SBATCH --error         %x_%j.err
#SBATCH --output        %x_%j.out

# Working directory
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/7.viruses/

# VirSorter2
## Load modules
module purge
module unload XALT
module load VirSorter/2.2.3-gimkl-2020a-Python-3.8.2

## Output directory
mkdir -p VirSorter2

## Run VirSorter2
virsorter run -j $SLURM_CPUS_PER_TASK \
  -i spades_assembly/spades_assembly.m1000.fna \
  -d /nesi/project/nesi02659/MGSS_2024/resources/databases/virsorter2_20210909 \
  --min-score 0.7 --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae \
  --prep-for-dramv \
  -l mgss -w VirSorter2 \
  --tmpdir ${SLURM_JOB_ID}.tmp --rm-tmpdir \
  all \
  --config LOCAL_SCRATCH=${TMPDIR:-/tmp}

# CheckV
## Load module
module purge
module load CheckV/1.0.1-gimkl-2022a-Python-3.10.5

## Output directory
mkdir -p checkv_out

## Run CheckV
checkv end_to_end VirSorter2/mgss-final-viral-combined.fa checkv_out/ -t $SLURM_CPUS_PER_TASK

module unload XALT

For the VirSorter/2.2.3-gimkl-2020a-Python-3.8.2 NeSI module to work properly, we must also include module unload XALT in the script above.

VirSorter2 parameters

The key parameters you may want to consider altering for your own work are --min-score and --include-groups. For today's excersice we will include all available groups (--include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae), and will set the min-score to 0.7. You can experiment with this value for your own data (see the Virsorter2 github page for more information).

Submit the script as a slurm job

sbatch VirSorter2_and_checkv.sl

VirSorter2 for your own work

The required databases for VirSorter2 are not loaded with the NeSI module. For your own work, you will need to first download these databases and provide the path to the -d flag. For today's workshop this is already set up.

Identifying viral contigs in metagenomic data¶