Introduction to binning¶
Objectives
Remove short contigs from the data set¶
Ideally, we do not want to be creating bins from all of the assembled contigs, as there is often a long tail of contigs which are only several k-mers long. These have little biological meaning, as they are too short for robust gene annotation, and they can introduce a significant degree of noise in the clustering algorithms used for binning. We therefore identify a suitable threshold for a minimum length of contigs to be considered for binning.
We have already done this in the previous exercise so we could either use the existing filtering at 1,000 bp in length, or move to something stricter. Most binning tools have a default cut-off for minimum contig size - MetaBAT
uses a default minimum of 2,500 bp, and recommends at least 1,500 bp. By contrast, MaxBin
sets the minimum length at 1,000 bp.
Obtain coverage profiles for assembled contigs via read mapping¶
Binning is done using a combination of information encoded in the composition and coverage of the assembled contigs. Composition refers to k-mer (usually tetranucleotide) frequency profiles of the contigs, which are generally conserved within a genome. By contrast, coverage is a reflection of the abundance of the contigs in the assembly. Organisms which are more abundant will contribute more genomic material to the metagenome, and hence their DNA will be, on average, more abundant in the sample. When binning, we can look for pieces of DNA which are not assembled together, but have similar composition and occur at approximately equal abundances in the sample to identify contigs which likely originate in the same genome.
The composition of the contigs is calculated by the binning tool at run time, but to obtain coverage information we must map our unassembled reads from each sample against the assembly to generate the differential abundance profiles for each contig. This is achieved using bowtie2
to map the reads against the assembly, then samtools
to sort and compress the resulting file.
Create a mapping index¶
Before we can map reads, we need to create a bowtie2
index file from the assembly, for use in read mapping. Navigate into the 5.binning/
folder to begin.
code
If you look inside the spades_assembly/
folder you will now see the following:
Terminal output
These files ending in .bt2 are the index files for bowtie2
, and are specific to this tool. If you wish to map using an alternate tool (for example bowtie
or BBMap
) you will need to create index/database files using these programs.
Generally speaking, we don't need to know the names of the index files, as they are simply referred to be the output name we specified (bw_spades) when running bowtie2
.
Mapping the reads¶
We will create a slurm script to perform the mapping steps, as these benefit greatly from the multithreaded capacity of NeSI and we will use a for loop to iterate over each set of reads to simplify our script.
The full script is provided here, and we will discuss it below.
Open a new script using nano:
Remember to update <YOUR FOLDER>
to your own folder.
code
Step 1 - Loop through the sample files¶
Since we just want to perform the same set of operations on each file, we can use a for loop to repeat each operation on a set of files. The structure of the loop, and use of variables was covered on the first day.
For large sets of files, it can be beneficial to use a slurm array to send the jobs out to different nodes and distribute the process across many independent jobs. An example of how we could modify the above script is given at the bottom of this exercise, but is not necessary for the purposes of this workshop.
Step 2 - Map the reads using bowtie2¶
This is performed using the following parameters
Parameter |
Function |
---|---|
--minins |
Minimum insert size, determines the minimum distance between the start of each read pair |
--maxins |
Maximum insert size, determines the maximum distance between the start of each read pair |
--threads |
Number of threads to use in read mapping |
--sensitive |
Specifies where we want to be positioned in the trade-off between speed and sensitivity. See the manual for more information |
-x |
The base name of our assembly index. Should be exactly the same as what was specified when running bowtie2-build |
-1 / -2 |
The forward and reverse read pairs to map to the assembly |
-S |
Name of the output file, to be written in sam format |
Step 3 - Sort and compress results¶
The default output format for most mapping tools is the Sequence Alignment/Map (sam) format. This is a compact text representation of where each short read sits in the contigs. You can view this file using any text viewer, although owing to the file size less
is a good idea.
Generally I wouldn't bother with this - there is a lot of information in here and unless you are looking to extract specific information from the alignment directly, this is just an intermediate file in our workflow. In order to save disk space, and prepare the file for downstream analysis we now perform two final steps:
- Sort the mapping information
- Compress the sam file into its binary equivalent, bam
Which is achieved with the following parameters
Parameter | Function |
---|---|
sort |
Subcommand for samtools to invoke the sort operation |
-@ |
Number of threads to use for sorting and compressing |
-o |
Output file name. When we specify the bam extension samtools automatically compresses the output |
Do not open a .bam
file, your terminal might crash!
Compressing the sam files to bam format is an important step as sam files can be massive. It is also helpful to sort the mapping information so that reads mapped to a contig are listed in order of their start position. For example
Reads will initially be mapped in an unsorted order, as they are added to the sam file in more or less the same order as they are encountered in the original fastQ files.
Sorting the mapping information is an important prerequisite for performing certain downstream processes. Not every tool we use requires reads to be sorted, but it can be frustrating having to debug the instances where read sorting matters, so we typically just get it done as soon as possible and then we don't have to worry about it again.
(Optional) Read mapping using an array¶
If you have a large number of files to process, it might be worth using a slurm array to distribute your individual mapping jobs across many separate nodes. An example script for how to perform this is given below. We do not need to use an array for read mapping in this workshop, but we will revisit array jobs in further lessons.
Open a new script using nano:
Remember to update <YOUR FOLDER>
to your own folder.
code