2. Tools & setup¶

Tools used for the pangenome graph pipeline

The PanGenome Graph Builder (PGGB) is used for graph construction (https://github.com/pangenome/pggb).
The Optimized Dynamic Genome/Graph Implementation (ODGI) is employed for graph manipulation (https://github.com/pangenome/odgi).
The VG Toolkit is utilized for variant calling with NGS against a pangenome graph reference (https://github.com/vgteam/vg).
Mash is used to estimate pairwise distances among genomes (https://github.com/marbl/Mash).
ProgressiveMauve is utilized to verify the alignment of genomes (https://darlinglab.org/mauve/user-guide/progressivemauve.html)
Circlator is employed to fix the starting point of genomes (https://sanger-pathogens.github.io/circlator/).
SAMtools is used to build an index of input FASTA files (https://github.com/samtools/samtools).
BCFtools is used to check the statistics of VCF files (https://github.com/samtools/bcftools).

Other Tools might be useful for pangenome graphs

The Minigraph-Cactus Pangenome Pipeline (https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md).
gfaestus, which can be used for graph visualization (https://github.com/chfi/gfaestus)
SequenceTubeMap for tube map visualization of genomic sequence graphs (https://github.com/vgteam/sequenceTubeMap)
......

Software stack for pggb pipeline

All of the required software is pre-installed and configured as modules. We will load them as needed

pggb/0.5.3-Miniconda3
SAMtools
Mash/2.3-GCC-11.3.0
Circlator/1.5.5-gimkl-2022a-Python-3.10.5
vg/1.46.0

Running the pggb workflow locally

The following code DOES NOT need to be run if you working on NeSI, but is provided here to show how to work through the workshop material on a non-NeSI system (e.g., if you are running the code on your own computer).

Obtaining pggb¶

From https://github.com/pangenome/pggb, you can find the details about installing pggb with Docker, Singularity, bioconda, guix, or by manually building its dependencies.

Using pggb via Singularity¶

Many managed HPCs utilize Singularity as a secure alternative to docker. Fortunately, docker images can be run through Singularity seamlessly. First pull the docker file and create a Singularity SIF image from the dockerfile. This might take a few minutes.

singularity pull docker://ghcr.io/pangenome/pggb:latest

Next clone the pggb repo and cd into it

git clone --recursive https://github.com/pangenome/pggb.git
cd pggb

Finally, run pggb from the Singularity image. For Singularity to be able to read and write files to a directory on the host operating system, we need to 'bind' that directory using the -B option and pass the pggb command as an argument.

singularity run -B ${PWD}/data:/data ../pggb_latest.sif pggb -i /data/HLA/DRB1-3123.fa.gz -p 70 -s 3000 -G 2000 -n 10 -t 16 -v -V 'gi|568815561:#' -o /data/out -M -m

Neisseria meningitidis data set¶

In this workshop, we use the bacterium Neisseria (N.) meningitidis as our model organism.

Bacteria called N. meningitidis cause meningococcal disease(IMD). About 1 in 10 people have these bacteria in the back of their nose and throat without being ill. This is called being ‘a carrier.’ Sometimes the bacteria invade the body and cause IMD.
Different capsular polysaccharides divide the meningococci into at least 12 serogroups, of which six groups (A, B, C, W, X, and Y) are responsible for most disease.
The genome of this bacterium spans approximately 2.1 to 2.4 Mb and possesses a GC content ranging from 51-52%.
A notable trait of N. meningitidis genomes is their high recombination rate, which contributes significantly to the substantial genetic diversity found in this species.
Throughout this workshop, we use five genome assemblies of N. meningitidis to evaluate the pangenome pipeline, from the construction of the pangenome graph to the calling of variants.

genomes	ASM IDs	GCF IDs	SEROGROUP	Sequence type	Clonal Complex
NC_017518.1 Neisseria meningitidis NZ-05/33	ASM19152v1	GCF_000191525.1	B	42	ST-41/44
NC_003112.2 Neisseria meningitidis MC58	ASM880v1	GCF_000008805.1	-	74	ST-32
NZ_CP007668.1 Neisseria meningitidis M0579	ASM102983v1	GCF_001029835.1	B	-	ST-41/44
NZ_CP016880.1 Neisseria meningitidis strain M07165	ASM170367v1	GCF_001703675.1	W	11	ST-11
NZ_CP020423.2 Neisseria meningitidis strain FDAARGOS_212	ASM207367v2	GCF_002073675.2	C	-	ST16521

How the Neisseria meningitidis genomes were formatted for this workshop

The following code DOES NOT need to be run, but is provided here to show how the Neisseria meningitidis genomes were downloaded and prepared for analysis.

# Create a new directory called nm_genomes and change to that directory
mkdir nm_genomes
cd nm_genomes

Download the genome assemblies from NCBI and uncompress. In the Unix environment you can use the curl command.

#NC_017518.1
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000191525.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_000191525.1.zip" -H "Accept: application/zip"

# ACC NUM?
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_001029835.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_001029835.1.zip" -H "Accept: application/zip"

# ACC NUM?
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_001703675.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_001703675.1.zip" -H "Accept: application/zip"

# ACC NUM?
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_002073675.2/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_002073675.2.zip" -H "Accept: application/zip"

# ACC NUM?
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCF_000008805.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,RNA_FASTA,CDS_FASTA,PROT_FASTA,SEQUENCE_REPORT&filename=GCF_000008805.1.zip" -H "Accept: application/zip"

On NeSI, a slurm job can be run to process the .fna genomes. The contents of the slurm job file (unzip_genomes.sl) are as follows:

#!/bin/bash -e

#SBATCH --account       nesi02659
#SBATCH --job-name      extract_fna
#SBATCH --cpus-per-task 8
#SBATCH --mem           4G
#SBATCH --time          1:00:00

data=$HOME/nm_genomes/*.zip

for f in $data
do
    x=$(basename $f .zip)
    echo ${x}

    unzip $x.zip

    cp $HOME/nm_genomes/ncbi_dataset/data/${x}/*_genomic.fna /$HOME/nm_genomes/

    rm -rf ncbi_dataset
done

The slurm job can be run via:

sbatch unzip_genomes.sl

Remove unneeded files:

rm cds_from_genomic.fna
rm *.zip
rm README.md
rm slurm*.out

Use the cat command to combine genomes into a single fasta file:

cat *_genomic.fna > 5NM.fa

Fix the start point of the input genomes?

Circlator¶

Bacterial genomes are typically circular, so establishing a fixed starting point for each input genome during pangenome graph construction could reduce unnecessary complexity within the graph. Once the genomes are aligned with the same starting point, we can build their index using samtools faidx and proceed with constructing the pangenome graph. I encourage you to experiment with this approach after this workshop to see how effectively it functions. Please note that it may be necessary to try different regions as a starting point to ensure that all genomes are fixed with the same initial region.

let's fix the start for all genome using circlator, submit a slurm job. It takes less than one minute for each sample.

#!/bin/bash -e 

#SBATCH --account       nesi02659
#SBATCH --job-name      restart_fna
#SBATCH --cpus-per-task 8
#SBATCH --mem           4G
#SBATCH --time          1:00:00

module load Circlator/1.5.5-gimkl-2022a-Python-3.10.5

cd /home/zyang/pg_test
data=/home/zyang/pg_test/*.fna

#For some dataset, we may have to try different start point to make sure all genomes being fixed with the same region. 
start_region=start_region.fa


for f in $data
do

x=$(basename $f .fna)
echo ${x}

circlator fixstart --genes_fa  start_region  ${x}.fna  ${x}.restart

done

Use the cat command to combine genomes into a single fasta file:

cat *_genomic.restart.fasta > 5NMfs.fa

Fix the start point of the 5NM genomes

One tricky aspect is finding a region that is present in all the genomes and is oriented in the same direction The mauve alignment of the 5NM genomes

The Mauve alignment after attempting to fix the start point using the first 50 lines of genome one.

The Mauve alignment after fixing the start point using the first 50 lines of the second genome, which has been aligned with the first 50 lines of genome one.

Setting up your project directory and downloading the dataset¶

code

Create a new directory under your home folder and change to that directory.
```
mkdir ~/pg_workshop
```
```
cd ~/pg_workshop
```
~/pg_workshop will be your working directory
Keep a note of the absolute path of your working directory
```
pwd
```

Output

/home/<YOUR_USER_ID>/pg_workshop

Please download the dataset.

git clone https://github.com/ZoeYang2020/dataset_for_pg_workshop

Copy the 5NM.fa and dataset to your working directory (To keep the filenames simple, we have renamed the 5NMfs.fa to 5NM.fa for this workshop)
```
cp ./dataset_for_pg_workshop/5NM.fa ./
```