1. Introduction to pangenome graphs¶
What is a pangenome?¶
A pangenome is defined as the comprehensive collection of whole-genome sequences from multiple individuals within a clade, a population or a species. This collective genomic dataset can be further divided into two distinct components: the core genome, which includes regions present in all individuals at the time of analysis, and the accessory genome, consisting of regions only found in a subset of individuals.
What is a pangenome graph?¶
Pangenome graphs represent pangenomes using graph models, effectively capturing the complete genetic variation across the input genomes. These graphs consist of three components: nodes, edges, and paths.
Nodes¶
- DNA segments, which can be any length
Edges¶
- Describe the possible ways of walking through the nodes
- Connect pairs of node strands
- Can represent inversions
Paths¶
- Paths are routes through the nodes of the graph
- Genomes
- Haplotypes
- Alleles/variants
Overview of a pangenome graph construction pipeline¶
Pangenome concstruction by Pangenome Graph Builder (PGGB)¶
The PGGB pipeline is a reference-free method. It builds pangenome graphs using an all-to-all whole genome alignment approach with wfmash. Seqwish is employed to induce the graph, followed by progressive normalization with smoothxg and gfaffix.
graph manipulation using ODGI and multiQC report¶
- The Optimized Dynamic Genome/graph Implementation (ODGI) is used for various graph manipulation tasks, including visualization.
- MultiQC is used to generate a report, which includes statistics of the seqwish-induced graph, the final graph, and various visualizations of the final graph.
Obtain distance for phylogenetic analysis¶
We use ODGi to extract distances between paths within the graph, enabling further phylogenetic analysis.
Varaint calling¶
By using the pangenome graph created with PGGB, it is possible to concurrently identify a variety of genetic variations. These include structural variations (SVs), rearrangements, and smaller variants such as single nucleotide polymorphisms (SNPs) and insertions/deletions. These can be identified through the process of vg deconstruction.
NGS data analysis against graph¶
The VG toolkit is utilized for NGS data analysis against the graph, including tasks such as read mapping and variant calling
Key components of this pipeline¶
- Graph construction using the PGGB
- Graph manipulation using ODGI
- Variant calling for NGS data using the VG toolkit
It provides an efficient and integrated approach for pangenome analysis.