Genome Assembly

Objective: Using quality filtered reads to assemble bacterial genomes, and assessing how different assembler parameters can impact assembly quality.

1. De novo genome assembly with SPAdes

SPAdes is a popular de novo assembler for microbial genome assembly. It uses a multi-kmer strategy, contains assembly pipelines for isolated prokaryotic genome data and supports Illumina short reads, single-cell sequencing, and hybrid assembly (Illumina + Nanopore/PacBio).

1. Create slurm script

nano spades_assembly_v1.sh

Copy in spades_assembly_v1.sh

#!/bin/bash -e

#SBATCH --account       nesi02659
#SBATCH --job-name      spades_assembly_v1
#SBATCH --time          00:45:00
#SBATCH --mem           10GB
#SBATCH --array         0-9
#SBATCH --cpus-per-task 8
#SBATCH --error         slurm_spades_assembly_v1_%A-%a.err
#SBATCH --output        slurm_spades_assembly_v1_%A-%a.out
#SBATCH --partition     milan

module purge >/dev/null 2>&1
module load SPAdes/4.0.0-foss-2023a-Python-3.11.6

declare -a array=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10") 

#Move to home directory
cd ~/mgsr

#Create directory for SPAdes output files
mkdir -p 5.spades_assembly/v1/${array[$SLURM_ARRAY_TASK_ID]}

#Run SPAdes
srun spades.py -k auto -t 8 \
  -1 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R1.fastq \
  -2 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R2.fastq \
  -s 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_single.fastq \
  -o 5.spades_assembly/v1/${array[$SLURM_ARRAY_TASK_ID]}/

Submit slurm job

sbatch spades_assembly_v1.sh

Explanation of SPAdes parameters
-k	K-mer sizes
-t	Number of threads
-1	File with forward reads
-2	File with reverse reads
-s	File with unpaired reads
-o	Output directory

Output files

SPAdes generates several important output files:

contigs.fasta: Contains the assembled contigs (contiguous sequences without gaps).
scaffolds.fasta: Contains scaffolds, which are contigs linked together by paired-end information, with gaps represented as “N” bases. This file is useful for gene prediction and other genome-wide analyses.
assembly_graph: Represents the de Bruijn graph used during assembly. This is useful for advanced users who want to investigate assembly ambiguities or potential misassemblies.

2. Optimizing SPAdes

Objective: Explore how different SPAdes parameters affect assembly quality and contiguity

SPAdes provides several options to adjust and optimize the assembly process:

The k-mer size affects how reads overlap and form contigs.
The –careful mode (mismatch correction) may improve accuracy but reduce contiguity.
Adjusting coverage cutoffs can help remove low quality contigs.

1. Assemble an isolate using different k-mer settings

By default, SPAdes automatically selects k-mer sizes, but we can manually specify them. The recommended k-mer sizes for different sequencing technologies can be found in the SPAdes documentation.

–> Repeat steps in 3.1 and change kmer sizes.

nano spades_assembly_v2.sh

Copy in spades_assembly_v2.sh

#!/bin/bash -e

#SBATCH --account       nesi02659
#SBATCH --job-name      spades_assembly_v2
#SBATCH --time          00:45:00
#SBATCH --mem           10GB
#SBATCH --array         0-9
#SBATCH --cpus-per-task 8
#SBATCH --error         slurm_spades_assembly_v2_%A-%a.err
#SBATCH --output        slurm_spades_assembly_v2_%A-%a.out
#SBATCH --partition     milan

module purge >/dev/null 2>&1
module load SPAdes/4.0.0-foss-2023a-Python-3.11.6

declare -a array=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10") 

#Move to home directory
cd ~/mgsr

#Create directory for SPAdes output files
mkdir -p 5.spades_assembly/v2/${array[$SLURM_ARRAY_TASK_ID]}

#Run SPAdes
srun spades.py -k 33,55,77,99,121 -t 8 \
  -1 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R1.fastq \
  -2 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R2.fastq \
  -s 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_single.fastq \
  -o 5.spades_assembly/v2/${array[$SLURM_ARRAY_TASK_ID]}/

Submit slurm job

sbatch spades_assembly_v2.sh

2. Experiment with --careful

The –careful mode aims to reduce the number of mismatches and short indels. This option is recommended only for assembly of small genomes such as bacterial genomes, but not for larger genomes.

–> Repeat steps in 3.1 and add the –careful flag.

nano spades_assembly_v3.sh

Copy in 4.spades_assembly_v3.sh

#!/bin/bash -e

#SBATCH --account       nesi02659
#SBATCH --job-name      spades_assembly_v3
#SBATCH --time          00:45:00
#SBATCH --mem           10GB
#SBATCH --array         0-9
#SBATCH --cpus-per-task 8
#SBATCH --error         slurm_spades_assembly_v3_%A-%a.err
#SBATCH --output        slurm_spades_assembly_v3_%A-%a.out
#SBATCH --partition     milan

module purge >/dev/null 2>&1
module load SPAdes/4.0.0-foss-2023a-Python-3.11.6

declare -a array=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10") 

#Move to home directory
cd ~/mgsr

#Create directory for SPAdes output files
mkdir -p 5.spades_assembly/v3/${array[$SLURM_ARRAY_TASK_ID]}

#Run SPAdes
srun spades.py -k auto --careful -t 8 \
  -1 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R1.fastq \
  -2 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_R2.fastq \
  -s 2.clean_reads/clean.${array[$SLURM_ARRAY_TASK_ID]}_single.fastq \
  -o 5.spades_assembly/v3/${array[$SLURM_ARRAY_TASK_ID]}/

Submit slurm job

sbatch spades_assembly_v3.sh

3. Comparing assembly statistics

1. Filter out short sequences

Before looking at the assemblies in more detail, we will first remove scaffolds that are shorter than 1000 bp. It is common practice to do so - there are several reasons why, such as minimizing assembly artefacts and improving downstream analyses, especially gene prediction (that is partly because the average bacterial gene length is aroud 900 to 1000 bp).

To filter out sequences < 1000 bp, we will use seqmagick. Clear out previous modules to avoid conflicts and load seqmagick.

module purge
module load seqmagick/0.8.4-gimkl-2020a-Python-3.8.2

Run seqmagick on all scaffold files using a loop.

for i in {01..10}; do
  for j in v1 v2 v3; do
    seqmagick convert --min-length 1000 5.spades_assembly/"$j"/"$i"/scaffolds.fasta 5.spades_assembly/"$j"/"$i"/scaffolds_"$i"_"$j".m1000.fasta
  done
done

2. Gather all assembled genomes into one folder

for i in {01..10}; do
  for j in v1 v2 v3; do
    mkdir -p 6.assembly_evaluation/"$j" | cp 5.spades_assembly/"$j"/"$i"/scaffolds_"$i"_"$j".m1000.fasta 6.assembly_evaluation/"$j"/scaffolds_"$i"_"$j".m1000.fasta
  done
done

3.Extract stats and evaluate assembly quality

To evaluate the quality of genome assemblies, we will use BBMap’s statswrapper.sh tool. This tool provides comprehensive assembly statistics, allowing us to assess key metrics crucial for determining assembly quality.

Number of contigs/scaffolds: Fewer contigs/scaffolds generally indicate a more contiguous assembly, which is desirable. However, an extremely low number might indicate chimeric assemblies or misassemblies.
N50/L50: Length of the shortest contig at 50% of the total genome length. A higher N50 indicates more contiguous assemblies / Number of contigs representing 50% of the genome. A lower L50 is typically better, as it suggests fewer but longer contigs.
Total number of basepairs in assembly: This should roughly match the expected genome size. Significant deviations may suggest incomplete assembly or contamination.
GC Content: Check for abnormal GC content, which could indicate contamination.
Maximum scaffold/contig length: Provides an idea of the longest assembled sequence.

Now let’s run BBMap’s statswrapper.sh on the different variations of assemblies perfomed using a loop.

module purge
module load BBMap/39.01-GCC-11.3.0 
for j in v1 v2 v3; do
    statswrapper.sh 6.assembly_evaluation/"$j"/* > 6.assembly_evaluation/assembly_stats_"$j".txt 
done

Let’s have a look at the assembly quality metrics assessed by BBMap statswrapper.sh tool.

cd 6.assembly_evaluation/

head -n 3 assembly_stats_v1.txt

n_scaffolds     n_contigs       scaf_bp contig_bp       gap_pct scaf_N50        scaf_L50        ctg_N50 ctg_L50 scaf_N90        scaf_L90        ctg_N90 ctg_L90 scaf_max        ctg_max scaf_n_gt50K scaf_pct_gt50K   gc_avg  gc_std  filename
108     125     6451046 6449346 0.026   15      120978  19      100947  52      35705   62      31253   401116  377043  40      82.328  0.44562 0.01855 /scale_wlg_nobackup/filesets/nobackup/uoa00626/emilie_working/test2-SR-workshop/5.assembly_evaluation/v1/scaffolds_01_v1.m1000.fasta
1049    1086    7872491 7870141 0.030   102     21818   109     20218   550     2148    572     2109    84541   84541   18      15.058  0.44444 0.02782 /scale_wlg_nobackup/filesets/nobackup/uoa00626/emilie_working/test2-SR-workshop/5.assembly_evaluation/v1/scaffolds_02_v1.m1000.fasta

Subset files to view 3 of the key metrics mentioned above (number of scaffolds, total number of basepairs in assembly and N50).

echo "SPAdes assembly v1" && cut -f1,3,7 assembly_stats_v1.txt 
echo "SPAdes assembly v2" && cut -f1,3,7 assembly_stats_v2.txt 
echo "SPAdes assembly v3" && cut -f1,3,7 assembly_stats_v3.txt

Discussion points:

How do N50 values differ between default vs. custom k-mer assembly?
Does manually setting k-mers improve contiguity?
Does --cov-cutoff auto remove unwanted small contigs?
What are the trade-offs between accuracy (–careful) and contiguity (N50, fewer contigs)?