2.3 - Assembling Oxford Nanopore data¶
Overview¶
time
- Teaching: 15 minutes
- Exercises: 10 minutes
Objectives and Key points
Objectives¶
- Perform an assemble of a bacterial genome using the
Flye
assembly tool.
Keypoints¶
- The
Flye
assembler is one of several very powerful tools for assembling genomes from long read data. - Different tools may be better suited for different data, and long read assembly is still a developing field so if working with these data, make sure to experiment with multiple tools.
Introduction to Flye¶
Although the gap is closing rapidly, Oxford Nanopore sequences are fundamentally more error prone than the sequences we obtain through Illumina sequencing and a considerable amount of assembly is spent identifying and correcting errors to produce high-quality contigs from a comparably low-quality set of reads.
The median sequence quality for the Nanopore data produced using the (now retired 9.4 chemistry) data sits around Q20 for most of the sequence. This corresponds to 99% accuracy which might sound good, but by definition half of the sequences have lower quality than this. At the low end of this plot the sequences are slightly above Q10, which denotes 90% accuracy.
Finding consensus regions between pairs of reads, when one of them might differ by up to 10% of it's composition just due to sequencing error alone makes assembly a complicated process and assembly tools which are aware of the error profiles of our long read data are essential.
Why use Flye
?
The complete workflow of Flye
(Kolmogorov et al., 2019) is quite complicated process. The novel aspect of assembly with Flye
when compared with other asssembly tools was the developers observation that when working with noisy reads (as mentioned above) mapping sequences against each other is confounded by highly similar repeat regions, which create hotspots of local alignment between reads with very different flanking sites.
For our purposes, the main points of the assembly process to understand are:
- Repetitive regions of the genome are identified
- Repeat regions are clustered together to form deliberately misassembled disjointigs
- Disjointigs are concatenated and a repeat graph (similar to a de Bruijn graph) is created
- The input reads are mapped to the repeat graph
- Using the reads that map to the repeat region and it's 5' and 3' flanking regions, the loops in the assembly graph are unwound to create long contigs
To run Flye
, navigate to your assembly_nanopore/
directory, and prepare the following slurm
script:
flye_asm.sl
#!/bin/bash -e
#SBATCH --account nesi03181
#SBATCH --job-name flye_asm
#SBATCH --time 00:10:00
#SBATCH --cpus-per-task 16
#SBATCH --mem 20G
#SBATCH --error flye_asm.%j.err
#SBATCH --output flye_asm.%j.out
#SBATCH --mail-type END
#SBATCH --mail-user YOUR_EMAIL
module purge
module load Flye/2.9.1-gimkl-2022a-Python-3.10.5
# Set the path from which the script will execute Flye
cd /nesi/project/nesi03181/phel/USERNAME/level2/assembly_nanopore/
# Execute Flye
flye --threads ${SLURM_CPUS_PER_TASK} \
--nano-raw reads/Mbovis_87900.nanopore.fq.gz \
--out-dir assembly/
When you are ready, submit the job to slurm
:
As with the SPAdes
session, we will end up with a folder named assembly/
which contains a number of files, only some of which we care about. The key files for us are:
assembly.fasta
- the assembled contigs, as a fasta file.flye.log
- the log file of the stepsSPAdes
performed and any warnings which occured during assembly.assembly_graph.gfa
- a map of how well the assembly is resolved.
A note on assembly tools¶
For the exercise today are using the Flye
assembler with one of the M. bovis genomes. Like with other areas of genomics, there are many good options for assembly tools and our usage of Flye
today is in no way an endorsement that we consider this tool to be the 'best' long read assembler. Flye
is a very good tool and will give us good results with the data we process today, but when working with real data there are many other good options to try, including:
UniCycler
(andTriCycler
) (Wick et al., 2017) - https://github.com/rrwick/UnicyclerCanu
(Koren et al., 2017)raven
(Vaser et al., 2021)
A recent comparison of assembly tools was published by Wick & Holt (2021) tests some of the options listed above along with several other tools. Their manuscript is a 'living paper', which has been updated several times as new versions of each tool are released. Different assemblers have risen to the top at different time points, so this is very much still an evolving field, and it is difficult to say which assembler is the 'best'.
In practice, there are sometimes particular cases where a tool will not be compatible with your data, so it is helpful to be aware of several tools so that you have options if assembly proves problematic for a particular sample.