Identifying viral contigs in metagenomic data¶
Identifying viral contigs¶
Viral metagenomics is a rapidly progressing field, and new software are constantly being developed and released each year that aim to better identify and characterise viral genomic sequences from assembled metagenomic sequence reads.
Currently, the most commonly used methods are VirSorter2
, VIBRANT
, and VirFinder
(or the machine learning implementation of this, DeepVirFinder
). A number of recent studies use one of these tools or a combination of several at once.
Uses a predicted protein homology reference database-based approach, together with searching for a number of pre-defined metrics based on known viral genomic features. VirSorter2
includes dsDNAphage, ssDNA, and RNA viruses, and the viral groups Nucleocytoviricota and lavidaviridae.*
More info
Uses a machine learning approach based on protein similarity (non-reference-based similarity searches with multiple HMM sets), and is in principle applicable to bacterial and archaeal DNA and RNA viruses, integrated proviruses (which are excised from contigs by VIBRANT
), and eukaryotic viruses.
More info
Uses a machine learning based approach based on k-mer frequencies. Having developed a database of the differences in k-mer frequencies between prokaryote and viral genomes, VirFinder examines assembled contigs and identifies whether their k-mer frequencies are comparable to known viruses in the database, using this to predict viral genomic sequence. This method has some limitations based on the viruses that were included when building the database (bacterial DNA viruses, but very few archaeal viruses, and, at least in some versions of the software, no eukaryotic viruses). However, tools are also provided to build your own database should you wish to develop an expanded one. Due to its distinctive k-mer frequency-based approach, VirFinder may also have the capability of identifying some novel viruses overlooked by tools such as VIBRANT or VirSorter.
More info
Identifying viral contigs using VirSorter2
¶
For this exercise, we will use VirSorter2
to identify viral contigs from our assembled contigs. We can also use VirSorter2
to prepare files for later use with the gene annotation tool DRAM-v
, which we'll run later in the day.
Checking quality and estimate completeness of the viral contigs via CheckV
¶
CheckV
was developed as an analogue to CheckM
. CheckV
first performs a 'contaminating sequence' trim, removing any retained (prokaryote) host sequence on the end of contigs with integrated prophage, and then assesses the quality and completeness of the assembled viral contigs. The quality of the contigs are also categoriesed based on the recently developed Minimum Information about an Unclutivated Virus Genome (MIUViG) standards for reporting sequences of unclutivated virus geneomes (such as those recovered from metagenomic sequencing data). The MIUViG were developed as an extension of the Minimum Information about any (x) Sequence (MIxS) standards, which include, among others, standards for Metagenome-Assembled Genomes (MIMAG).
Run VirSorter2
and CheckV
¶
These exercises will take place in the 7.viruses/
folder.
For VirSorter2
, we will input the assembled contigs from the SPAdes
assembly we performed earlier. These assembly files have been copied to 7.viruses/spades_assembly/
for this exercise.
We will then run CheckV
in the same script, providing the FASTA file of viral contigs output by VirSorter2
as input (final-viral-combined.fa
).
Remember to update <YOUR FOLDER>
to your own folder.
code
module unload XALT
For the VirSorter/2.2.3-gimkl-2020a-Python-3.8.2
NeSI module to work properly, we must also include module unload XALT
in the script above.
VirSorter2
parameters
The key parameters you may want to consider altering for your own work are --min-score
and --include-groups
. For today's excersice we will include all available groups (--include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae
), and will set the min-score to 0.7. You can experiment with this value for your own data (see the Virsorter2 github page for more information).
VirSorter2
for your own work
The required databases for VirSorter2
are not loaded with the NeSI module. For your own work, you will need to first download these databases and provide the path to the -d
flag. For today's workshop this is already set up.