Genome Annotation
Objectives
- Assessing genome quality
- Assigning taxonomy
- Predicting gene sequences
1. Genome quality
To assess the quality of the assembled genomes, we will use CheckM2, which assesses the degree of completess and contamination of microbial genomes (whether they are from isolates, single cell or metagenome data) using machine learning.
First, let’s gather all assembled genomes into one directory.
mkdir -p 7.genome_annotation/all_assembled_genomes
for i in {01..10}; do
cp 6.assembly_evaluation/v3/scaffolds_"$i"_v3.m1000.fasta 7.genome_annotation/all_assembled_genomes/genome_"$i".fasta
done
1. Create a directory for CheckM2 files
mkdir 7.genome_annotation/genome_quality
2. Run CheckM2
Create SLURM script to run CheckM2.
nano checkm2.sh
and copy the following:
#!/bin/bash -e
#SBATCH --account nesi02659
#SBATCH --job-name checkm2
#SBATCH --time 00:30:00
#SBATCH --mem 50GB
#SBATCH --cpus-per-task 10
#SBATCH --error slurm_checkm2_%j.err
#SBATCH --output slurm_checkm2_%j.out
#SBATCH --partition milan
module purge >/dev/null 2>&1
module load CheckM2/1.0.1-Miniconda3
cd ~/mgsr
srun checkm2 predict --threads 10 \
-x fasta \
--input 7.genome_annotation/all_assembled_genomes/ \
--output-directory 7.genome_annotation/genome_quality/
Submit SLURM job.
sbatch checkm2.sh
Output files
CheckM2 generates several output files and directories that provide detailed metrics on genome quality.
predicted_quality.tsv: this table contains the estimated completeness and contamination for each genome, along with other quality metrics. Key columns include:
Completeness: Percentage of expected single-copy genes present in the genome. High completeness (>90%) indicates a near-complete genome.
Contamination: Percentage of duplicated single-copy genes, indicating contamination or strain heterogeneity. Low contamination (<5%) is ideal.
Strain Heterogeneity: A measure of strain-level variation. High values may indicate the presence of multiple closely related strains.
3. Create a subset of CheckM2 results for later use (visualisation)
cd 7.genome_annotation/genome_quality
# Define input and output files
input_file="quality_report.tsv"
output_file="genome_quality.txt"
# Process the file
awk 'BEGIN {OFS="\t"}
NR==1 {print "ID", $2, $3, $9; next}
{print $1, $2, $3, $12}' "$input_file" > "$output_file"
# Display the result
cat "$output_file"
2. Taxonomic classification
To assign taxonomy to the assembled genomes, we are using GTDB-Tk (Genome Taxonomy Database Toolkit), a bioinformatics tool designed for the taxonomic classification of bacterial and archaeal genomes. It uses the Genome Taxonomy Database (GTDB), which provides a standardized taxonomy based on genome phylogeny based on a set of conserved single-copy proteins.
1. Create a directory for GTDB-Tk files
cd ~/mgsr
mkdir 7.genome_annotation/taxonomy
2. Run GTDB-Tk
Create SLURM script
nano taxonomy.sh
and copy the following:
#!/bin/bash -e
#SBATCH --account nesi02659
#SBATCH --job-name gtdbtk
#SBATCH --time 02:00:00
#SBATCH --mem 140GB
#SBATCH --cpus-per-task 10
#SBATCH --error slurm_gtdbtk_%j.err
#SBATCH --output slurm_gtdbtk_%j.out
#SBATCH --partition milan
module purge >/dev/null 2>&1
module load GTDB-Tk/2.4.0-foss-2023a-Python-3.11.6
cd ~/mgsr
srun gtdbtk classify_wf \
-x fasta \
--cpus 10 \
--genome_dir 7.genome_annotation/all_assembled_genomes/ \
--skip_ani_screen \
--out_dir 7.genome_annotation/taxonomy/
Submit SLURM job.
sbatch taxonomy.sh
Output files
GTDB-Tk generates several output files, including:
classification_summary.tsv: a summary of taxonomic classifications, including the most specific taxonomic rank achieved (e.g., species or genus). Key columns include:
- user_genome: input genome filename.
- classification: full taxonomic classification in GTDB format (e.g., d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; …).
- fastani_reference: GTDB reference genome with the highest average nucleotide identity (ANI).
- classification_method: method used for classification, typically based on ANI or phylogenetic placement. gtdbtk_out/classify/gtdbtk.bac120.summary.tsv: Detailed classification information for bacterial genomes, including ANI values and confidence scores. gtdbtk_out/classify/gtdbtk.ar122.summary.tsv: Detailed classification information for archaeal genomes.
3. Gene prediction and annotation
Here, we aim to identify protein-coding sequences, translate them into protein sequences, and generate functional annotations for each gene.
This step uses Prokka, a tool specifically desgined to find and annotate gene coding sequences and other genomic features from prokaryotes and viruses. It first uses Prodigal which is a very popular and robust protein-coding gene prediction tool that can handle draft genomes, then uses a variety of databases when trying to assign function to the predicted gene sequences.
1. Create a directory for gene prediction
Create a new directory for predicted genes files.
cd ~/mgsr
mkdir 7.genome_annotation/gene_prediction
2. Reformat scaffold headers An issue that can occur when working with sequences assembled by SPAdes is the length of headers. Some tools used post-assembly will complain about SPAdes long headers and will not run successfully. We will use some of the files Prokka generated in the next section (as input for GenoVi specifically), where we would run into issues if we didn’t shorten headers before running Prokka.
cd 7.genome_annotation/all_assembled_genomes/
for file in genome_*.fasta; do
base=$(basename "$file" .fasta)
# Create a temporary file with updated headers
sed -E "s/^>NODE_([0-9]+)_.*$/>${base}_NODE_\1/" "$file" > tmpfile && mv tmpfile "$file"
done
Now if we look at genome_01 for example, the first sequence header ‘>NODE_1_length_401218_cov_15.417427’ has been transformed to ‘>genome_01_NODE_1’.
3. Run Prokka
Write a SLURM script to automate Prokka for processing all assembled genomes.
cd ~/mgsr
nano gene_prediction.sh
Copy the following script in gene_prediction.sh
#!/bin/bash -e
#SBATCH --account nesi02659
#SBATCH --job-name prokka
#SBATCH --time 00:15:00
#SBATCH --mem 1GB
#SBATCH --array 0-9
#SBATCH --cpus-per-task 10
#SBATCH --error slurm_prokka_%j.err
#SBATCH --output slurm_prokka_%j.out
#SBATCH --partition milan
module purge >/dev/null 2>&1
module load prokka/1.14.5-GCC-9.2.0
declare -a array=("01" "02" "03" "04" "05" "06" "07" "08" "09" "10")
cd ~/mgsr
srun prokka 7.genome_annotation/all_assembled_genomes/genome_${array[$SLURM_ARRAY_TASK_ID]}.fasta \
--outdir 7.genome_annotation/gene_prediction \
--prefix genome_${array[$SLURM_ARRAY_TASK_ID]} \
--cpus 10
Submit SLURM job.
sbatch gene_prediction.sh
Output files
Prokka generates an array of new files per genome:
ls 7.genome_annotation/gene_prediction
genome_01.err
genome_01.faa
genome_01.ffn
genome_01.fna
genome_01.fsa
genome_01.gbk
genome_01.gff
genome_01.log
genome_01.sqn
genome_01.tbl
genome_01.tsv
genome_01.txt
...
Extension | Description |
---|---|
.gff | This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. |
.gbk | This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. |
.fna | Nucleotide FASTA file of the input contig sequences. |
.faa | Protein FASTA file of the translated CDS sequences. |
.ffn | Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) |
.sqn | An ASN1 format “Sequin” file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. |
.fsa | Nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. |
.tbl | Feature Table file, used by “tbl2asn” to create the .sqn file. |
.err | Unacceptable annotations - the NCBI discrepancy report. |
.log | Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled. |
.txt | Statistics relating to the annotated features found. |
.tsv | Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product |