Genome Annotation

Objectives

  • Assessing genome quality
  • Assigning taxonomy
  • Predicting gene sequences

1. Genome quality

To assess the quality of the assembled genomes, we will use CheckM2, which assesses the degree of completess and contamination of microbial genomes (whether they are from isolates, single cell or metagenome data) using machine learning.

First, let’s gather all assembled genomes into one directory.

mkdir -p 7.genome_annotation/all_assembled_genomes

for i in {01..10}; do
  cp 6.assembly_evaluation/v3/scaffolds_"$i"_v3.m1000.fasta 7.genome_annotation/all_assembled_genomes/genome_"$i".fasta
done

1. Create a directory for CheckM2 files

mkdir 7.genome_annotation/genome_quality

2. Run CheckM2

Create SLURM script to run CheckM2.

nano checkm2.sh

and copy the following:

#!/bin/bash -e

#SBATCH --account       nesi02659
#SBATCH --job-name      checkm2
#SBATCH --time          00:30:00
#SBATCH --mem           50GB
#SBATCH --cpus-per-task 10
#SBATCH --error         slurm_checkm2_%j.err
#SBATCH --output        slurm_checkm2_%j.out
#SBATCH --partition     milan

module purge >/dev/null 2>&1
module load CheckM2/1.0.1-Miniconda3

cd ~/mgsr 

srun checkm2 predict --threads 10  \
  -x fasta \
  --input 7.genome_annotation/all_assembled_genomes/ \
  --output-directory 7.genome_annotation/genome_quality/

Submit SLURM job.

sbatch checkm2.sh

Output files

CheckM2 generates several output files and directories that provide detailed metrics on genome quality.

predicted_quality.tsv: this table contains the estimated completeness and contamination for each genome, along with other quality metrics. Key columns include:

  • Completeness: Percentage of expected single-copy genes present in the genome. High completeness (>90%) indicates a near-complete genome.

  • Contamination: Percentage of duplicated single-copy genes, indicating contamination or strain heterogeneity. Low contamination (<5%) is ideal.

  • Strain Heterogeneity: A measure of strain-level variation. High values may indicate the presence of multiple closely related strains.

3. Create a subset of CheckM2 results for later use (visualisation)

cd 7.genome_annotation/genome_quality

# Define input and output files
input_file="quality_report.tsv"
output_file="genome_quality.txt"

# Process the file
awk 'BEGIN {OFS="\t"} 
     NR==1 {print "ID", $2, $3, $9; next} 
     {print $1, $2, $3, $12}' "$input_file" > "$output_file"

# Display the result
cat "$output_file"

2. Taxonomic classification

To assign taxonomy to the assembled genomes, we are using GTDB-Tk (Genome Taxonomy Database Toolkit), a bioinformatics tool designed for the taxonomic classification of bacterial and archaeal genomes. It uses the Genome Taxonomy Database (GTDB), which provides a standardized taxonomy based on genome phylogeny based on a set of conserved single-copy proteins.

1. Create a directory for GTDB-Tk files

cd ~/mgsr 

mkdir 7.genome_annotation/taxonomy

2. Run GTDB-Tk

Create SLURM script

nano taxonomy.sh

and copy the following:

#!/bin/bash -e

#SBATCH --account       nesi02659
#SBATCH --job-name      gtdbtk
#SBATCH --time          02:00:00
#SBATCH --mem           140GB
#SBATCH --cpus-per-task 10
#SBATCH --error         slurm_gtdbtk_%j.err
#SBATCH --output        slurm_gtdbtk_%j.out
#SBATCH --partition     milan

module purge >/dev/null 2>&1
module load GTDB-Tk/2.4.0-foss-2023a-Python-3.11.6

cd ~/mgsr 

srun gtdbtk classify_wf \
  -x fasta \
  --cpus 10 \
  --genome_dir 7.genome_annotation/all_assembled_genomes/ \
  --skip_ani_screen \
  --out_dir 7.genome_annotation/taxonomy/

Submit SLURM job.

sbatch taxonomy.sh

Output files

GTDB-Tk generates several output files, including:

classification_summary.tsv: a summary of taxonomic classifications, including the most specific taxonomic rank achieved (e.g., species or genus). Key columns include:

  • user_genome: input genome filename.
  • classification: full taxonomic classification in GTDB format (e.g., d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; …).
  • fastani_reference: GTDB reference genome with the highest average nucleotide identity (ANI).
  • classification_method: method used for classification, typically based on ANI or phylogenetic placement. gtdbtk_out/classify/gtdbtk.bac120.summary.tsv: Detailed classification information for bacterial genomes, including ANI values and confidence scores. gtdbtk_out/classify/gtdbtk.ar122.summary.tsv: Detailed classification information for archaeal genomes.

3. Gene prediction and annotation

Here, we aim to identify protein-coding sequences, translate them into protein sequences, and generate functional annotations for each gene.

This step uses Prokka, a tool specifically desgined to find and annotate gene coding sequences and other genomic features from prokaryotes and viruses. It first uses Prodigal which is a very popular and robust protein-coding gene prediction tool that can handle draft genomes, then uses a variety of databases when trying to assign function to the predicted gene sequences.

1. Create a directory for gene prediction

Create a new directory for predicted genes files.

cd ~/mgsr 

mkdir 7.genome_annotation/gene_prediction

2. Reformat scaffold headers An issue that can occur when working with sequences assembled by SPAdes is the length of headers. Some tools used post-assembly will complain about SPAdes long headers and will not run successfully. We will use some of the files Prokka generated in the next section (as input for GenoVi specifically), where we would run into issues if we didn’t shorten headers before running Prokka.

cd 7.genome_annotation/all_assembled_genomes/

for file in genome_*.fasta; do
    base=$(basename "$file" .fasta)
    
    # Create a temporary file with updated headers
    sed -E "s/^>NODE_([0-9]+)_.*$/>${base}_NODE_\1/" "$file" > tmpfile && mv tmpfile "$file"
done

Now if we look at genome_01 for example, the first sequence header ‘>NODE_1_length_401218_cov_15.417427’ has been transformed to ‘>genome_01_NODE_1’.

3. Run Prokka

Write a SLURM script to automate Prokka for processing all assembled genomes.

cd ~/mgsr

nano gene_prediction.sh

Copy the following script in gene_prediction.sh

#!/bin/bash -e

#SBATCH --account       nesi02659
#SBATCH --job-name      prokka
#SBATCH --time          00:15:00
#SBATCH --mem           1GB
#SBATCH --array         0-9
#SBATCH --cpus-per-task 10
#SBATCH --error         slurm_prokka_%j.err
#SBATCH --output        slurm_prokka_%j.out
#SBATCH --partition     milan

module purge >/dev/null 2>&1
module load prokka/1.14.5-GCC-9.2.0

declare -a array=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10") 

cd ~/mgsr 

srun prokka 7.genome_annotation/all_assembled_genomes/genome_${array[$SLURM_ARRAY_TASK_ID]}.fasta \
  --outdir 7.genome_annotation/gene_prediction \
  --prefix genome_${array[$SLURM_ARRAY_TASK_ID]} \
  --cpus 10

Submit SLURM job.

sbatch gene_prediction.sh

Output files

Prokka generates an array of new files per genome:

ls 7.genome_annotation/gene_prediction
    genome_01.err
    genome_01.faa
    genome_01.ffn
    genome_01.fna
    genome_01.fsa
    genome_01.gbk
    genome_01.gff
    genome_01.log
    genome_01.sqn
    genome_01.tbl
    genome_01.tsv
    genome_01.txt
    ...
Extension Description
.gff This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.gbk This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
.fna Nucleotide FASTA file of the input contig sequences.
.faa Protein FASTA file of the translated CDS sequences.
.ffn Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.sqn An ASN1 format “Sequin” file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.fsa Nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.tbl Feature Table file, used by “tbl2asn” to create the .sqn file.
.err Unacceptable annotations - the NCBI discrepancy report.
.log Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled.
.txt Statistics relating to the annotated features found.
.tsv Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product