2.3 - Evaluating the results of an assembly¶
Overview¶
time
- Teaching: 20 minutes
- Exercises: 15 minutes
Objectives and Key points
Objectives¶
- Use
QUAST
to assess the assembly status. - (Optional) Use
Bandage
to view how well the assembly resolved.
Keypoints¶
- Tools like
QUAST
can be used to perform quick and easy comparisons between an assembly and a trusted reference genome. - It is important to make sure that your genome is sufficiently resolved to address your need, but we often do not need to go further than what an assembler provides.
Assessing the results of an assembly¶
Once assembly is complete, we have a complex process of determining the quality of the assembly. How 'good' a genome is can be difficult to measure, but as we are mostly working with well characterised pathogens a good starting place is to compare our assembled genome with previously characterised members of the same species to see how well the conserved genomic features have been reconstructed by our assembly tool.
Navigate to the assembly_evaluation/
folder and we will begin.
You have been provided with a copy of a reference Mycoplasmopsis bovis genome in the reference/
folder, but we will need some draft assemblies to test as part of this module.
Exercise
Create a new directory and copy in the assembly_evaluation/
folder and copy in your SPAdes
and Flye
fasta and fastg assembly files.
Help, my assembly failed!
If your assembly did not complete, don't worry about it. There is a training set of assemblies we can provide for you if required.
Once you have a local copy of your assemblies, we will be comparing these to the reference genome using a tool called QUAST.
Running QUAST
is quite simple:
code
Output
Version: 5.2.0
System information:
OS: Linux-3.10.0-693.2.2.el7.x86_64-x86_64-with-glibc2.17 (linux_64)
Python version: 3.10.5
CPUs number: 2
Started: 2023-09-28 15:02:30
# Text omitted...
Finished: 2023-09-28 15:02:40
Elapsed time: 0:00:10.422580
NOTICEs: 4; WARNINGs: 1; non-fatal ERRORs: 0
Thank you for using QUAST!
Open the resulting quast/report.pdf
file in Jupyter using the file browser. Take a look through the report and see if you can get a feel for how well your assemblies compare to the reference.
How do the Illumina and Nanopore assemblies differ, if at all?
(Optional) Visualising assemblies with Bandage
¶
We can also visualise the assemblies by looking at how well the loops and fragments of the assembly graph were resolved. For this, we require a different set of files from the assembly output folders.
Exercise
Copy the .fastg
(SPAdes
) and *.gfa
(Flye
) files from your previous output folders into your current assembly directory, ready for analysis.
Running the tool is then a matter of:
code
You can then open the assembly_bandage.svg
file in the Jupyter browser. Unfortunately, we cannot filter out the short contigs from this result. However, it should be clear that there is one long contig which has been assembled, and then a large number of short fragments.
Exercise
Repeat the Bandage
command for your Nanopore assembly, then contrast the result from what you obtained with SPAdes
. How do they differ?
Concluding comments¶
As you can see from this exercise, getting a pretty good genome assembly is not particularly difficult with the right tools. However, the distance between a draft assembly, which we have produced, and a final completed genome is a very long process and involved multiple rounds of assembly refinement, scaffolding, and often requires the creation of custom primers to perform PCRs to close sequence gaps which were not covered in your HTS library.
It can be hard knowing when the assembly is good enough to move out of the assembly stage and into annotation. In research groups working with genomic data, the yardstick for working with these kinds of data is typically to ask whether the current assembly is sufficient to answer the research question which led to its sequencing in the first place.
We can copy this logic and ask, what was the purpose of sequencing this genome and can we achieve that with the current data. Typically, we are most likely looking to perform a species identification. If we find that the genome assembly contains the right marker genes or operons to perform the identification then, regardless of whether the genome is officially completed or not, it has served its purpose.