Manual bin refinement¶

Objectives

Preparing input files for VizBin
Projecting a t-SNE using VizBin and examining bin clusters
Picking and exporting sequences
Refining bins

Preparing input files for `VizBin`¶

What is VizBin?

VizBin is a handy, GUI-based tool for creating ordinations of our binning data using the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm to project high-dimensional data down into a 2D plot that preserves clustering information. There's a really good video on YouTube that explains how the algorithm works in high-level terms, but for our purposes you can really consider it as a similar approach to a PCA or NMDS.

On its own, VizBin takes a set of contigs and performs the t-SNE projection using compositional data. We can optionally provide it files that annotate contigs as belonging to particular bins and a file that adds coverage data to be considered when clustering. Unfortuantely, at this stage VizBin only allows a single coverage value per contig, which is not ideal. This is because VizBin only uses coverage as a means to modify the visualisation, not the ordination itself. It is possible to create your own t-SNE projection using multiple coverage values, however this is beyond the scope of today's exercise, and here we will be providing VizBin with coverage values for sample1 only.

The only required input file for VizBin is a single .fna file of the concatenated bins. An additional annotation file containing per-contig coverage values and bin IDs can also be provided. Colouring contigs by bin is a really effective way to spot areas that might need refinement.

In the interests of time today, the input files have been generated and are provided in the 6.bin_refinement/ folder:

all_fragments.fna is a concatenation of the bins of fragmented sub-contigs (fragmented to 20k)
all_fragments.sample1.vizbin.ann is the annotation file containing per-subcontig coverage, label (bin ID), and length values.

Contig fragments as input for VizBin

When running VizBin, it is often preferable to split long contigs into smaller pieces in order to increase the density of clustering in the t-SNE. The data we are working with today are based on our bins output by DAS_Tool in the last binning exercise, but have been further processed using the cut_up_fasta.py script that comes with the binning tool CONCOCT to cut long contigs into 20k fragments. When reviewing our VizBin plots and outputs, it is important to remember that here we are looking at the fragmented sub-contigs, rather than the full complete contigs.

For future reference, and for working with your own data, a step-by-step process for generating these files from the dereplicated bins generated by DAS_Tool has been provided as an Appendix.

For this section, we will be working within 6.bin_refinement/. Let's first have a quick look at the annotation file.

Navigate to working directory

# Navigate to correct directory
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/6.bin_refinement

Inspect all_bins.sample1.vizbin.ann

head -n 5 all_fragments.sample1.vizbin.ann

Terminal output

coverage,label,length
17.6626,bin_0.chopped,20000
15.9561,bin_0.chopped,20000
17.294,bin_0.chopped,20000
15.8157,bin_0.chopped,20000

This file is a comma-delimited table (csv file) that presents the information in the way that VizBin expects it. The order of rows in this file corresponds to the order of contigs in the concatenated FASTA file of our fragmented bins, all_fragments.fna.

Create a few variations of the .ann file with various columns removed, in order to examine the different outputs they can generate.

code

Bin ID onlyBin ID and coverage without length

cut -f2 -d ',' all_fragments.sample1.vizbin.ann > all_fragments.sample1.vizbin.bin_only.ann

cut -f1,2 -d ',' all_fragments.sample1.vizbin.ann > all_fragments.sample1.vizbin.no_length.ann

Symbolic links for easy access

By default, VizBin redirects to your home directory. To make it easy to come back to our working directory, we can make a symbolic link (aka shortcut) that points to here from home.

ln -sr $(pwd) ~/

The flags mean:

-s designates it as a symbolic, instead of hard link
-r indicates that paths are relative

Projecting a t-SNE and examining bin clusters¶

We can now use these files in VizBin to curate the contigs in our bins. We will load and view the data in a few different steps.

For this exercise, we will be using the Virtual Desktop on NeSI which allows us to use programmes with graphical user interfaces (GUI) within the NeSI computing environment.

Initiate the Virtual Desktop

Click on File on the top left corner of the Jupyter Hub, and then select New Launcher. You can also click on the small + on the tabbed bar next to your terminal tab.
Look for the Virtual Desktop icon and then click on centre of the icon.
A new browser tab named 'TurboVNC:...' should appear and the Virtual Desktop should load.
A successful instance of the Virtual Desktop should look like a desktop environment.

Running VizBin: local vs remote

Running VizBin remotely (e.g. within NeSI) can be slow with full data sets. Running a GUI (such as a program like VizBin) remotely can also require additional set up on some PCs. For day-to-day work, we recommend installing VizBin on your local machine and downloading the relevant input files (e.g. via scp ...) to run locally.

Initiate `VizBin` within the Virtual Desktop environment¶

In the Virtual Desktop, click on the terminal icon.
VizBin is a Java programme, therefore we will need to load the Java module for it to work. In the terminal, type the following to load the Java module

Copy/paste in the Virtual Desktop

You will not be able to copy text from outside the Virtual Desktop and paste into the Virtual Desktop, in which case you will need to manually type these commands.
code
```
module load Java/20.0.2
```

In the terminal, navigate to your directory where the Java file resides

Remember to replace YOUR FOLDER with your user name.

code

cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/6.bin_refinement/

ls

Terminal output

all_fragments.fna                 example_data_unchopped  mock_fragments.fna
all_fragments.sample1.vizbin.ann  mock_bins               mock_fragments.sample3.vizbin.bin_only.ann
example_data_20k                  mock_bins_checkm_out    vizbin_count_table.sh
example_data_20k_cov.txt          mock_bins.checkm.txt    VizBin-dist.jar

Type the following into your Virtual Desktop terminal to initiate VizBin.
code
```
java -jar VizBin-dist.jar
```
A successful launch of VizBin will look like the following:

Load input files¶

Once VizBin is open, to get started, click the 'Choose...' button then navigate to the FASTA file all_fragments.fna.

VizBin directory

It is likely that when we click on the 'Choose...' button, the directory that is open will be our NeSI home directory. In that case, we can type in the input FASTA file in the 'File Name' bar: /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/6.bin_refinement/all_fragments.fna

Once this is imported, use the 'Show additional options' button to expose the advanced options, and add your 'bin only' .ann file into the 'Annotation file (optional)' field.

Executing the t-SNE¶

For now leave all other parameters as default. Click the 'Start' button to begin building the ordination. When it completes, you should see an output similar to the following:

Contigs coloured by bin¶

Additional annotations by length and coverage

If you input all_bins.sample1.vizbin.ann as your annotation file, you can see that the visualisation takes contig length (represented by point size) and coverage (represented by opacity) into account.

Similar to other projection techniques, we interpret the closeness of points as a proxy for how similar they are, and because of our .ann file we can see which contigs belong to the same bin.

What do scaffolds look like?

In the example above, we used fragmented scaffolds as input files for VizBin. Take a look at what scaffolds look like with VizBin. Run the following code to generate input files for VizBin to visualise scaffolds.

code

echo "label" > all_scaffolds.vizbin.ann
for fasta in example_data_unchopped/*.fna; do
    bin=$(basename ${fasta} .fna)
    cat ${fasta} >> all_scaffolds.fna
    grep '>' ${fasta} | sed "s/.*/${bin}/g" >> all_scaffolds.vizbin.ann
done

Import the newly generated all_scaffolds.fna and all_scaffolds.vizbin.ann into VizBin to visualise it.

1. Picking and exporting sequences¶

We can use the interactive GUI to pick the boundaries of new bins, or to identify contigs which we do not believe should be retained in the data. Have a play around with the interface, testing out the following commands:

Left-click and drag: Highlight an area of the ordination to zoom into
Right-click, 'Zoom Out', 'Both Axes': Rest of the view
Left-click several points: Create a selection of contigs to extract from the data
Right-click, 'Selection', 'Export': Save the selected contigs into a new file
Right-click, 'Selection', 'Clear selection': Clear the current selection

How you proceed in this stage is up to you. You can either select bins based on their boundary, and call these the refined bins. Alternatively, you could select outlier contigs and examine these in more detail to determine whether or not they were correctly placed into the bin. Which way you proceed really depends on how well the ordination resolves your bins, and it might be that both approaches are needed.

First, for each VizBin cluster, select the area around the cluster (via multiple left-clicks around the cluster), right-click, 'Selection', 'Export'. Save this output as cluster_1.fna.

Try this for one or two clusters. In practice, we would do this for each VizBin cluster, saving each as a new cluster_n.fna file.

Highlight a cluster to zoom into¶

Select the cluster to export¶

Left-click several points around the cluster

Export the cluster¶

Right-click, 'Selection', 'Export'. Save the output as cluster_1.fna.

2. Refining bins¶

VizBin is a general purpose tool for contig/scaffold/fragment visualisation. For this workshop, we're going to attempt to refine a few bins. Refinement can mean reducing contamination, improving completeness and/or splitting strains. Here, we will:

Diagnose and visualise contig fragments
Export clusters of contig fragments
Check distribution of contig fragments across clusters
Pull new contigs from concatenated bin files to form new bin clusters
Check if genome metrics improve after manual refinement

For this exercise, we will be using data from the mock_fragments* files, mock_bins/ sub-directory, and additional genome metrics in mock_bins.checkm.txt. These files were generated using a different assembly (reads assembled using MEGAHIT) based on a modified sample 3 library, and then binned using MetaBAT2 and MaxBin as per previous lessons.

Inspect the genome metrics¶

Open up the file named mock_bins.checkm.txt. Take note of the metrics of each bin and consider what we might want to improve on.

Prepare output directory¶

We also need to prepare an output directory for the clusters that we export.

code

1	`mkdir -p vb_export`

Load the files into VizBin¶

Return to the VizBin set-up dialogue in the Virtual Desktop. Select the following input files:

File to visualize: mock_fragments.fna

Annotation file: mock_fragments.sample3.vizbin.bin_only.ann

Your VizBin will look like this:

image

Legends

When you have projected the tSNE, you can show the bin identity by right-clicking anywhere on the ordination and then click "Legend". This will give you a box that shows you which contig belongs to which bin.

Export clusters¶

Based on information you have on hand, make your selection around points that you think:

Should form bins (name these sequences cluster_n.bin_n.fa, where bin_n reminds you which bin most of the fragments come from)
Are problematic fragments or fragments in doubt (name these sequences contigs_n.fa)

You don't have to circle everything, just the ones you think need refining!

Export the sequences into the vb_export/ directory we made earlier.

Check distribution of contig fragments¶

We have provided you with a script that collects information from your exported fragments in order to check if you need to remove contigs from the aggregated mock_fragments.fna. We run this script like so:

code

./vizbin_count_table.sh -i vb_export/ \
                        -e fa \
                        -c cluster \
                        -s contig

vizbin_count_table.sh flags

Flags	Description
`-i`	Input directory where FASTA files of drawn clusters are located
`-e`	File extension for drawn clusters
`-c`	File prefix for fragments that form new bins
`-s`	File prefix for problematic fragments or whose placements are doubtful.

The script counts how problematic fragments are distributed across different clusters (based on the contig they came from). This helps us make an informed decision if removing a contig from a cluster would impact other clusters.

The outputs are:

vb_count_table.txt: A table of fragments and where other fragments of the same contig are also distributed in different clusters.
vb_omit_contigs_tmp.txt: A list of contigs (not fragments) that were designated as problematic to be removed.

Remove problematic contigs¶

If your vb_count_table.txt has fragments of a contig that are spread across different clusters, you may need to reconsider removing that contig (i.e., there are non-zero counts in multiple cluster columns). This is because removing them may impact the composition of other clusters.

Otherwise, if problematic contig fragments do not form clusters, we can simply ignore them when reforming new bins. In this exercise, it is highly unlikely you will encounter contig fragments shared across multiple clusters. As such, we will simply ignore the problematic contigs from the concatenated contigs file.

Form new bins¶

Start by creating a new directory for refined bins.

code

mkdir -p refined_bins/

We then create a file with the relevant contigs from our newly formed clusters.

code

for cluster in vb_export/cluster*.fa; do
    grep '>' ${cluster} \
    | sed -e 's/>//g' -e 's/\.concoct_part_[0-9]*//g' \
    | sort -u \
    > ${cluster/.fa/.contigID}
done

Then, we use SeqKit to extract the required contigs.

code

module purge
module load SeqKit/2.4.0

for ids in vb_export/cluster*.contigID; do
    seqkit grep -f ${ids} <(cat mock_bins/*.fna) > refined_bins/$(basename ${ids} .contigID).fna
done

Refining by omission

If you want to use the vb_omit_contigs_tmp.txt as the search pattern to omit contigs from the bins, you can do that following:

code

module purge
module load SeqKit/2.4.0

for bins in mock_bins/*.fna; do
    seqkit grep -v -f vb_omit_contigs.tmp.txt > refined_bins/$(basename ${bins} .fna).refined.fna
done

In the code above, we added a -v to the grep command to indicate invert matches.

Check your new bins!¶

Moment of truth! How did your decisions impact the genome metrics of the refined bins? Run your selections through CheckM and see how you did!

code

#!/bin/bash -e
#SBATCH --account       nesi02659
#SBATCH --job-name      CheckM_vb_exports
#SBATCH --partition     milan
#SBATCH --time          00:20:00
#SBATCH --mem           50GB
#SBATCH --cpus-per-task 10
#SBATCH --error         %x_%j.err
#SBATCH --output        %x_%j.out

# Load modules
module purge
module load CheckM/1.2.3-foss-2023a-Python-3.11.6

# Working directory
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/6.bin_refinement/

# Run CheckM
checkm lineage_wf -t $SLURM_CPUS_PER_TASK \
                  --pplacer_threads $SLURM_CPUS_PER_TASK \
                  -x fna --tab_table -f refined_bins.checkm.txt \
                  refined_bins/ refined_bins.checkm_out/

Did you do better than the automated binning software? Consider the following:

Were the new clusters more complete?
How much contamination was removed?
Was there an substantial trade-off between completeness and contamination?
How would you pre-process the sequences differently prior to manual refinement?
Are there additional information that would have helped you in the decision-making process?

When your CheckM run finishes, check and compare how you did!

Manual bin refinement¶

Preparing input files for VizBin¶

Projecting a t-SNE and examining bin clusters¶

Initiate VizBin within the Virtual Desktop environment¶

Load input files¶

Executing the t-SNE¶

Contigs coloured by bin¶

1. Picking and exporting sequences¶

Highlight a cluster to zoom into¶

Select the cluster to export¶

Export the cluster¶

2. Refining bins¶

Inspect the genome metrics¶

Prepare output directory¶

Load the files into VizBin¶

Export clusters¶

Check distribution of contig fragments¶

Remove problematic contigs¶

Form new bins¶

Check your new bins!¶

Human refinement vs automated binning¶

Preparing input files for `VizBin`¶

Initiate `VizBin` within the Virtual Desktop environment¶