Skip to content

APPENDIX (ex11): Normalise per-sample coverage values by average library size (example)

Having generated per-sample coverage values, it is usually necessary to also normalise these values across samples of differing sequencing depth. Commonly, this is done by normalising to minimum or average library size alone.

In this case, the mock metagenome data we have been working with are already of equal depth, and so this is an unnecessary step for the purposes of this workshop. The steps below are provided for future reference as an example of one way in which the cov_table.txt output generated by jgi_summarize_bam_contig_depths above could then be normalised based on average library size.

Normalise to average read depth via the python script norm_jgi_cov_table.py

The script norm_jgi_cov_table.py is available in the folder 10.gene_annotation_and_coverage/, and is also available for download for future reference at this link.

Note

This script was developed as a simple example for this workshop. It has not yet been widely tested: it is recommended in early usage to manually check a few values to ensure the conversions in the output file are as expected.

In brief, this python script leverages the fact that the standard error output from bowtie2 includes read counts for each sample. This has been saved in mapping_filtered_bins_<ADD JOB ID>.err, as per the slurm script that was submitted for the read mapping step. Note that, since we know the order that bowtie2 processed the samples (based on the loop we provided to bowtie2: for i in sample1 sample2 sample3 sample4), we know that the read count lines in the output error file will appear in the same order. We can therefore iterate through each of these lines, extracting the individual sample read count each time. These values are then used to calculate the average read depth for all samples. Coverage values (in each of the *.bam columns) are normalised for each sample based on: (coverage / sample_read_depth) * average_read_depth. Finally, this is saved to the new file with the prefix 'normalised_' (e.g. normalised_bins_cov_table.txt).

Note

In your own work, if you alternatively chose to use BBMap (and BBMaps covstats output) for the previous coverage calculation step, read counts can similarly be extracted from the scafstats output by searching for the line "Reads Used: ...".

norm_jgi_cov_table.py requires two input arguments, and takes an additional optional output path argument.

  • -c: Coverage table generated by jgi_summarize_bam_contig_depths
  • -e: Standard error file created by read mapping via bowtie2
  • -o: (Optional) Path to output directory (must already exist) (default is the current directory).

Run norm_jgi_cov_table.py for microbial bins data and viral contigs, inputting:

  • A. the bins_cov_table.txt and mapping_filtered_bins_<ADD JOB ID>.err files
  • B. the viruses_cov_table.txt and mapping_viruses_<ADD JOB ID>.err files.

This will generate the outputs normalised_bins_cov_table.txt and normalised_viruses_cov_table.txt.

Note

If this python script is in the directory you are currently in, you can call it simply by adding ./ in front of the script name. If you have saved the script elsewhere, you will need to add the absolute path to the script, or add the script to your bin path.*

code

module purge
module load Python/3.8.2-gimkl-2020a

./norm_jgi_cov_table.py -c bins_cov_table.txt -e mapping_filtered_bins_<ADD JOB ID>.err
./norm_jgi_cov_table.py -c viruses_cov_table.txt -e mapping_filtered_viruses_<ADD JOB ID>.err

What if I only have BAM files?

You can still normalise your data even with BAM files alone. But this will involve an additional step of obtaining library size/read depth information from the BAM files using SAMtools.

1. Obtain library size information

We can extract this based on the output from SAMtools flagstat command.

code

module purge
module load SAMtools/1.15.1-GCC-11.3.0

for i in bin_coverage/*.bam; do
  filename=$(basename $i)
  libsize=$(($(samtools flagstat $i | head -n 1 | cut -f 1 -d ' ')/2))
  printf "%s\t%d\n" $filename $libsize >> libsize.txt
done

Most of the commands in the above code block should be familiar to you. Here, we use a loop to go through all the BAM files (mapping outputs from bowtie2). Something that might be new here is $(($(samtools flagstat sample1.bam | head -n 1 | cut -f 1 -d ' ')/2)). Let's go through the code chunk by chunk, starting inside and moving our way outwards:

  • samtools flagstat $i: This calls the flagstat subcommand which provides some stats about our mapping.If you were to run it on sample1.bam, the results would look like the following (notice in the first line that the first number is double that of read 1 or read 2):

    flagstat output

    1099984 + 0 in total (QC-passed reads + QC-failed reads)
    1099984 + 0 primary
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    0 + 0 primary duplicates
    1002053 + 0 mapped (91.10% : N/A)
    1002053 + 0 primary mapped (91.10% : N/A)
    1099984 + 0 paired in sequencing
    549992 + 0 read1
    549992 + 0 read2
    549784 + 0 properly paired (49.98% : N/A)
    998676 + 0 with itself and mate mapped
    3377 + 0 singletons (0.31% : N/A)
    6466 + 0 with mate mapped to a different chr
    6267 + 0 with mate mapped to a different chr (mapQ>=5)
    
  • head -n 1: We only want the first line of this output.

  • cut -f 1 -d ' ': We only want the first column (first number) in the line that is space delimited.
  • $(( )): Anything encased in these double round brackets will be parsed arithmetically. If you were to do echo $((1+2)), it will print 3. Here, we want to perform a division based on the outputs ofsamtools flagstat ... (notice the /2 at the end).

Altogether, we generate, for each BAM file, the output of flagstat, take the first number and divide it by 2. This is our library size (written out to a file as libsize.txt).

2. Normalise and scale contig coverage

We then use the coverage table and library size files as inputs for a custom R script (you can also find it here). We can call this script via the command line like so:

code

module purge
module load R/4.2.1-gimkl-2022a

./normalise_jgi_cov.r bins_cov_table.txt libsize.txt

This script will generate an output in the current directory with the normalised_ prefix before the coverage table file name. It will also inform you if there are unequal or unmatched sample names.