2.2 - Assembling short-read Illumina data¶
Overview¶
time
- Teaching: 10 minutes
- Exercises: 10 minutes
Objectives and Key points
Objectives¶
- Perform an assemble of a bacterial genome using the
SPAdes
assembly tool.
Keypoints¶
- The
SPAdes
genome assembler is a powerful tool for assemnling genoms and contigs from a wide range of sampel types.
Introduction to the SPAdes assembly tool¶
The SPAdes
assembler is a very powerful and popular assembler which utilises de Bruijn graphs to assemble short read sequence data into larger contigs.
SPAdes is particularly good at...
- De novo assembly of short read sequences (i.e. Illumina or IonTorrent)
- Assembling small genomes (ideally <100 Mb, such as bacterial, viral, fungal, mitochondrial genomes)
Despite this, it is possible to apply SPAdes
to larger genomes and obtain very good results.
In order to begin, we must first find the versions of SPAdes
installed on NeSI and load the module of interest.
Remember to filter your data prior to assembly!
Before beginning to work with SPAdes
we need to ensure that our data is free from adapter sequences. The main reason for doing this is when we assemble sequences to form a genome, the assembler is looking for spans of nucleic acid sequence which are common to multiple reads, so that those reads can be joined together to create longer contigs. As the adapter sequence is an identical tag added to every read, these create regions of artificial homology between sequences which have no real connection to each other.
Before we attempt assembly it is critical to remove these from our data but as this was covered in the level 1 training we will not be revisiting the process here.
Performing an assembly using SPAdes¶
The test data is a set of Illumina MiSeq sequencing reads from an M. bovis genome. To save time, the reads have already been quality filtered with fastp
and the number of reads reduced to speed up analysis.
Navigate to the assembly_illumina/
folder to begin.
Exercise
Use your knowledge of the module system on NeSI to find the most recent release of SPAdes
.
Solution
Output
----------------------------------------------- /opt/nesi/CS400_centos7_bdw/modules/all -----------------------------------------------
SPAdes/3.13.1-gimkl-2018b SPAdes/3.15.0-gimkl-2020a SPAdes/3.15.3-gimkl-2020a
SPAdes/3.14.0-gimkl-2020a SPAdes/3.15.2-gimkl-2020a SPAdes/3.15.4-gimkl-2022a-Python-3.10.5 (D)
Create a slurm
script with the following contents. Be sure to replace the YOUR_EMAIL
and USERNAME
values with your details.
spades_asm.sl
#!/bin/bash -e
#SBATCH --account nesi03181
#SBATCH --job-name spades_asm
#SBATCH --time 00:40:00
#SBATCH --cpus-per-task 16
#SBATCH --mem 20G
#SBATCH --error spades_asm.%j.err
#SBATCH --output spades_asm.%j.out
#SBATCH --mail-type END
#SBATCH --mail-user YOUR_EMAIL
module purge
module load SPAdes/3.15.4-gimkl-2022a-Python-3.10.5
# Set the path from which the script will execute SPAdes
cd /nesi/project/nesi03181/phel/USERNAME/level2/assembly_illumina/
# Execute SPAdes
spades.py --isolate --threads ${SLURM_CPUS_PER_TASK} \
-1 reads/Mbovis_87900.miseq_R1.fq.gz \
-2 reads/Mbovis_87900.miseq_R2.fq.gz \
-o assembly/
Submit this job to slurm
:
When this job is complete, we will have a folder named assembly/
which contains a lot of different files and pieces of information. A lot of the contents are files generated internally by SPAdes
and we do not need to pay attention to them. The most important files for us are:
contigs.fasta
- the assembled contigs, as a fasta file.scaffolds.fasta
- the scaffolded contigs, as a fasta file.- This is similar to the contigs file, except it will contain sequences where contigs have been joined by an indeterminate piece of sequence.
- This occurs when
SPAdes
can tell from the pairing information in our library that two contigs belong adjacent to each other, but it has no information for filling the gap between them. spades.log
- the log file of the stepsSPAdes
performed and any warnings which occurred during assembly.assembly_graph.fastg
- a map of how well the assembly is resolved.- This can be useful if we are trying to obtain a complete genome, as it shows us areas which have assembled cleaning and areas which were difficult to resolve.
Different modes for running SPAdes
If you checked the help command for SPAdes
, or were examining the contents of the slurm
script carefully, you might have noted the flag --isolate
which we passed to the command.
SPAdes
was originally created as a tool for a very specific use case - the assembly of single-cell amplified genomes, which were problematic to assembly due to the highly uneven coverage of the nucleic acid content in the input libraries (Bankevich et al., 2012). It was also noted at the time that despite this fairly niche scope for use, the tool outperformed many traditional short read assemblers of the time and so over the years the team behind SPAdes
have expanded the tools and added a number of alternate assembly modes into successive iterations of the tool.
There are now specific subroutines in the SPAdes
assembler for working with the following data types:
- metaSPAdes - assembly of metagenomic samples (sequence libraries from multiple organisms)
- plasmidSPAdes - selective assembly of plasmids from genomic libraries
- metaplasmidSPAdes - selective assembly of plasmids from metagenomic
- rnaSPAdes - de novo assembly of transcriptome libraries
- biosyntheticSPAdes - selective assembly of biosynthetic gene clusters (BCG) from a library
- rnaviralSPAdes - selective de novo assembly of RNA viruses from transcriptome, metatranscriptome, and metavirome libraries
It will take a while for SPAdes
to complete, so we will move to the next exercise and return to this folder later.