Genome Assembly
Objective: Using quality filtered reads to assemble bacterial genomes, and assessing how different assembler parameters can impact assembly quality.
1. De novo genome assembly with SPAdes
SPAdes is a popular de novo assembler for microbial genome assembly. It uses a multi-kmer strategy, contains assembly pipelines for isolated prokaryotic genome data and supports Illumina short reads, single-cell sequencing, and hybrid assembly (Illumina + Nanopore/PacBio).
1. Create slurm script
nano spades_assembly_v1.sh
Copy in spades_assembly_v1.sh
#!/bin/bash -e
#SBATCH --account nesi02659
#SBATCH --job-name spades_assembly_v1
#SBATCH --time 00:45:00
#SBATCH --mem 10GB
#SBATCH --array 0-9
#SBATCH --cpus-per-task 8
#SBATCH --error slurm_spades_assembly_v1_%A-%a.err
#SBATCH --output slurm_spades_assembly_v1_%A-%a.out
#SBATCH --partition milan
module purge >/dev/null 2>&1
module load SPAdes/4.0.0-foss-2023a-Python-3.11.6
declare -a array=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10")
#Move to home directory
cd ~/mgsr
#Create directory for SPAdes output files
mkdir -p 5.spades_assembly/v1/${array[$SLURM_ARRAY_TASK_ID]}
#Run SPAdes
srun spades.py -k auto -t 8 \
-1 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R1.fastq \
-2 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R2.fastq \
-s 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_single.fastq \
-o 5.spades_assembly/v1/${array[$SLURM_ARRAY_TASK_ID]}/
Submit slurm job
sbatch spades_assembly_v1.sh
Explanation of SPAdes parameters | |
---|---|
-k | K-mer sizes |
-t | Number of threads |
-1 | File with forward reads |
-2 | File with reverse reads |
-s | File with unpaired reads |
-o | Output directory |
Output files
SPAdes generates several important output files:
- contigs.fasta: Contains the assembled contigs (contiguous sequences without gaps).
- scaffolds.fasta: Contains scaffolds, which are contigs linked together by paired-end information, with gaps represented as “N” bases. This file is useful for gene prediction and other genome-wide analyses.
- assembly_graph: Represents the de Bruijn graph used during assembly. This is useful for advanced users who want to investigate assembly ambiguities or potential misassemblies.
2. Optimizing SPAdes
Objective: Explore how different SPAdes parameters affect assembly quality and contiguity
SPAdes provides several options to adjust and optimize the assembly process:
- The k-mer size affects how reads overlap and form contigs.
- The –careful mode (mismatch correction) may improve accuracy but reduce contiguity.
- Adjusting coverage cutoffs can help remove low quality contigs.
1. Assemble an isolate using different k-mer settings
By default, SPAdes automatically selects k-mer sizes, but we can manually specify them. The recommended k-mer sizes for different sequencing technologies can be found in the SPAdes documentation.
–> Repeat steps in 3.1 and change kmer sizes.
nano spades_assembly_v2.sh
Copy in spades_assembly_v2.sh
#!/bin/bash -e
#SBATCH --account nesi02659
#SBATCH --job-name spades_assembly_v2
#SBATCH --time 00:45:00
#SBATCH --mem 10GB
#SBATCH --array 0-9
#SBATCH --cpus-per-task 8
#SBATCH --error slurm_spades_assembly_v2_%A-%a.err
#SBATCH --output slurm_spades_assembly_v2_%A-%a.out
#SBATCH --partition milan
module purge >/dev/null 2>&1
module load SPAdes/4.0.0-foss-2023a-Python-3.11.6
declare -a array=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10")
#Move to home directory
cd ~/mgsr
#Create directory for SPAdes output files
mkdir -p 5.spades_assembly/v2/${array[$SLURM_ARRAY_TASK_ID]}
#Run SPAdes
srun spades.py -k 33,55,77,99,121 -t 8 \
-1 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R1.fastq \
-2 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R2.fastq \
-s 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_single.fastq \
-o 5.spades_assembly/v2/${array[$SLURM_ARRAY_TASK_ID]}/
Submit slurm job
sbatch spades_assembly_v2.sh
2. Experiment with --careful
The –careful mode aims to reduce the number of mismatches and short indels. This option is recommended only for assembly of small genomes such as bacterial genomes, but not for larger genomes.
–> Repeat steps in 3.1 and add the –careful flag.
nano spades_assembly_v3.sh
Copy in 4.spades_assembly_v3.sh
#!/bin/bash -e
#SBATCH --account nesi02659
#SBATCH --job-name spades_assembly_v3
#SBATCH --time 00:45:00
#SBATCH --mem 10GB
#SBATCH --array 0-9
#SBATCH --cpus-per-task 8
#SBATCH --error slurm_spades_assembly_v3_%A-%a.err
#SBATCH --output slurm_spades_assembly_v3_%A-%a.out
#SBATCH --partition milan
module purge >/dev/null 2>&1
module load SPAdes/4.0.0-foss-2023a-Python-3.11.6
declare -a array=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10")
#Move to home directory
cd ~/mgsr
#Create directory for SPAdes output files
mkdir -p 5.spades_assembly/v3/${array[$SLURM_ARRAY_TASK_ID]}
#Run SPAdes
srun spades.py -k auto --careful -t 8 \
-1 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R1.fastq \
-2 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R2.fastq \
-s 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_single.fastq \
-o 5.spades_assembly/v3/${array[$SLURM_ARRAY_TASK_ID]}/
Submit slurm job
sbatch spades_assembly_v3.sh
3. Comparing assembly statistics
1. Filter out short sequences
Before looking at the assemblies in more detail, we will first remove scaffolds that are shorter than 1000 bp. It is common practice to do so - there are several reasons why, such as minimizing assembly artefacts and improving downstream analyses, especially gene prediction (that is partly because the average bacterial gene length is aroud 900 to 1000 bp).
To filter out sequences < 1000 bp, we will use seqmagick. Clear out previous modules to avoid conflicts and load seqmagick.
module purge
module load seqmagick/0.8.4-gimkl-2020a-Python-3.8.2
Run seqmagick on all scaffold files using a loop.
for i in {01..10}; do
for j in v1 v2 v3; do
seqmagick convert --min-length 1000 5.spades_assembly/"$j"/"$i"/scaffolds.fasta 5.spades_assembly/"$j"/"$i"/scaffolds_"$i"_"$j".m1000.fasta
done
done
2. Gather all assembled genomes into one folder
for i in {01..10}; do
for j in v1 v2 v3; do
mkdir -p 6.assembly_evaluation/"$j" | cp 5.spades_assembly/"$j"/"$i"/scaffolds_"$i"_"$j".m1000.fasta 6.assembly_evaluation/"$j"/scaffolds_"$i"_"$j".m1000.fasta
done
done
3.Extract stats and evaluate assembly quality
To evaluate the quality of genome assemblies, we will use BBMap’s statswrapper.sh tool. This tool provides comprehensive assembly statistics, allowing us to assess key metrics crucial for determining assembly quality.
- Number of contigs/scaffolds: Fewer contigs/scaffolds generally indicate a more contiguous assembly, which is desirable. However, an extremely low number might indicate chimeric assemblies or misassemblies.
- N50/L50: Length of the shortest contig at 50% of the total genome length. A higher N50 indicates more contiguous assemblies / Number of contigs representing 50% of the genome. A lower L50 is typically better, as it suggests fewer but longer contigs.
- Total number of basepairs in assembly: This should roughly match the expected genome size. Significant deviations may suggest incomplete assembly or contamination.
- GC Content: Check for abnormal GC content, which could indicate contamination.
- Maximum scaffold/contig length: Provides an idea of the longest assembled sequence.
Now let’s run BBMap’s statswrapper.sh on the different variations of assemblies perfomed using a loop.
module purge
module load BBMap/39.01-GCC-11.3.0
for j in v1 v2 v3; do
statswrapper.sh 6.assembly_evaluation/"$j"/* > 6.assembly_evaluation/assembly_stats_"$j".txt
done
Let’s have a look at the assembly quality metrics assessed by BBMap statswrapper.sh tool.
cd 6.assembly_evaluation/
head -n 3 assembly_stats_v1.txt
n_scaffolds n_contigs scaf_bp contig_bp gap_pct scaf_N50 scaf_L50 ctg_N50 ctg_L50 scaf_N90 scaf_L90 ctg_N90 ctg_L90 scaf_max ctg_max scaf_n_gt50K scaf_pct_gt50K gc_avg gc_std filename
108 125 6451046 6449346 0.026 15 120978 19 100947 52 35705 62 31253 401116 377043 40 82.328 0.44562 0.01855 /scale_wlg_nobackup/filesets/nobackup/uoa00626/emilie_working/test2-SR-workshop/5.assembly_evaluation/v1/scaffolds_01_v1.m1000.fasta
1049 1086 7872491 7870141 0.030 102 21818 109 20218 550 2148 572 2109 84541 84541 18 15.058 0.44444 0.02782 /scale_wlg_nobackup/filesets/nobackup/uoa00626/emilie_working/test2-SR-workshop/5.assembly_evaluation/v1/scaffolds_02_v1.m1000.fasta
Subset files to view 3 of the key metrics mentioned above (number of scaffolds, total number of basepairs in assembly and N50).
echo "SPAdes assembly v1" && cut -f1,3,7 assembly_stats_v1.txt
echo "SPAdes assembly v2" && cut -f1,3,7 assembly_stats_v2.txt
echo "SPAdes assembly v3" && cut -f1,3,7 assembly_stats_v3.txt
Discussion points:
- How do N50 values differ between default vs. custom k-mer assembly?
- Does manually setting k-mers improve contiguity?
- Does
--cov-cutoff auto
remove unwanted small contigs? - What are the trade-offs between accuracy (–careful) and contiguity (N50, fewer contigs)?