Skip to content

Filter raw reads by quality

image


Visualising raw reads

FastQC is an extremely popular tool for checking your sequencing libraries, as the visual interface makes it easy to identify the following issues:

  1. Adapter/barcode sequences
  2. Low quality regions of sequence
  3. Quality drop-off towards the end of read-pair sequence

Loading FastQC

These exercises will take place in the 2.fastqc/ directory. First, navigate to this directory. Copy the command below into your terminal (logged in to NeSI), replacing <YOUR FOLDER>, and then running the command.

Navigate to our working directory

cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/2.fastqc/

To activate FastQC on NeSI, you need to first load the module using the command

module purge
module load FastQC/0.11.9

Running FastQC

We will run FastQC from the command line as follows:

fastqc mock_R1.good.fastq.gz mock_R2.good.fastq.gz

Viewing the outputs from FastQC

FastQC generates output reports in .html files that can be viewed in a standard web browser.

Fortunately, if you're currently using the terminal within Jupyter hub for today's session, we can open the .html files directly from here:

  • Click on the directory icon in the top left to open the directory navigator pane (if not already open).
  • The default viewed location will be the overall project that you have logged in to (in this case, the 'Genomics Aotearoa Virtual Lab Training Access (nesi02659)' project).
  • Click through MGSS_U, into your directory, and then into the 2.fastqc/ directory.
  • Double click on the output ...fastqc.html files to open them in the a new tab within the Jupyter hub.

Examples of the output files are also available for download here.

Note that FastQC does not load the forward and reverse pairs of a library in the same window, so you need to be mindful of how your samples relate to each other.

image

At a first glance, we can see the follow statistics:

  1. The data is stored in Sanger / Illumina 1.9 encoding. This will be important to remember when read trimming.
  2. There are 100,000 reads in the file
  3. The maximum sequence length is 251 base pairs. This is good to check, since the data were generated using Illumina 2x250 bp sequencing.

Have a quick look through the menu sidebar on the left. As you can see, the data has passed most of the basic parameters.

image

image

image

The only aspect of the data that FastQC is flagging as potentially problematic is the GC% content of the data set. This observation is expected as we deal with a mixed community and organisms. Therefore, it is unlikely that there will be a perfect normal distribution around an average value. For example, a community comprised of low- and high-GC organisms would manifest a bimodal distribution of peaks which would be a problematic outcome in terms of the expectations of FastQC, but completely consistent with the biology of the system.

FastQC outputs for libraries with errors

Lets take a look at a library with significant errors. Process the sequence file mock_R1.adapter_decay.fastq with FastQC.

code

fastqc mock_R1.adapter_decay.fastq.gz

Compare the results with the mock_R1.good.fastq.gz file.

Which of the previous fields we examined are now flagged as problematic? How does this compare with your expectation? Are there any which should be flagged which are not?

image


Read trimming and adapter removal with trimmomatic

image

There are a multitude of programs which can be used to quality trim sequence data and remove adapter sequence. For this exercise we are going to use trimmomatic, but this should in no way be interpreted as an endorsement of trimmomatic over equivalent tools like BBMap, sickle, cutadapt or any other.

For a first run with trimmomatic, type the following commands into your console:

module load Trimmomatic/0.39-Java-1.8.0_144

trimmomatic PE -threads 2 -phred33 \
               mock_R1.adapter_decay.fastq.gz mock_R2.adapter_decay.fastq.gz \
               mock_R1.qc.fastq.gz mock_s1.qc.fastq.gz mock_R2.qc.fastq.gz mock_s2.qc.fastq.gz \
               HEADCROP:10 SLIDINGWINDOW:4:30 MINLEN:80

There is a lot going on in this command, so here is a breakdown of the parameters in the command above

Parameter Type Description
PE positional Specifies whether we are analysing single- or paired-end reads
-threads 2 keyword Specifies the number of threads to use when processing
-phred33 keyword Specifies the fastq encoding used
mock_R1.adapter_decay.fastq.gz / mock_R2.adapter_decay.fastq.gz positional The paired forward and reverse reads to trim
mock_R1.qc.fastq.gz positional The file to write forward reads which passed quality trimming, if their reverse partner also passed
mock_s1.qc.fastq.gz positional The file to write forward reads which passed quality trimming, if their reverse partner failed (orphan reads)
mock_R2.qc.fastq.gz / mock_s2.qc.fastq.gz positional The reverse-sequence equivalent of above
HEADCROP:10 positional Cut the specified number of bases from the start of the read
SLIDINGWINDOW:4:30 positional Quality filtering command. Analyse each sequence in a 4 base pair sliding window and then truncate if the average quality drops below Q30
MINLEN:80 positional Length filtering command. Discard sequences that are shorter than 80 base pairs after trimming

HEADCROP is not adapter trimming

We used HEADCROP to remove the first ten bases at the 5' end where we found poorer quality scores in those bases. For adapter trimming, we will need to use ILLUMINACLIP (see below).

Terminal output

TrimmomaticPE: Started with arguments:
 -threads 10 -phred33 mock_R1.adapter_decay.fastq.gz mock_R2.adapter_decay.fastq.gz mock_R1.qc.fastq.gz mock_s1.qc.fastq.gz mock_R2.qc.fastq.gz mock_s2.qc.fastq.gz HEADCROP:10 SLIDINGWINDOW:4:30 MINLEN:80
Input Read Pairs: 100000 Both Surviving: 75684 (75.68%) Forward Only Surviving: 12783 (12.78%) Reverse Only Surviving: 2311 (2.31%) Dropped: 9222 (9.22%)
TrimmomaticPE: Completed successfully

Running the trimmed files back through FastQC, we can see that this significantly improves the output.

image

image


Considerations when working with trimmomatic

Order of operations

The basic format for a trimmomatic command is

code

trimmomatic PE <keyword flags> <sequence input> <sequence output> <trimming parameters>

The trimming parameters are processed in the order you specify them. This is a deliberate behaviour, but can have some unexpected consequences for new users.

For example, consider these two scenarios:

code

trimmomatic PE <keyword flags> <sequence input> <sequence output> SLIDINGWINDOW:4:30 MINLEN:80

trimmomatic PE <keyword flags> <sequence input> <sequence output> MINLEN:80 SLIDINGWINDOW:4:30

In the first run, we would not expect any sequence shorter than 80 base pairs to exist in the output files. However, we might encounter them in the second command. This is because in the second command we remove sequences shorter than 80 base pairs, then perform quality trimming. If a sequence is trimmed to a length shorter than 80 base pairs after trimming, the MINLEN filtering does not execute a second time. In the first instance, we do not perform trimming before size selection, so any reads that start longer than 80 base pairs, but are trimmed to under 80 base pairs during quality trimming will be caught in the MINLEN run.


(Optional) Working with the ILLUMINACLIP command

Trimmomatic also provides the ILLUMINACLIP command, which can be used to pass a FASTA file of adapter and barcode sequences to be found and removed from your data. For further information about the ILLUMINACLIP command, then you can see its use here in the trimmomatic manual.

code

ILLUMINACLIP:iua.fna:1:25:7

ILLUMINACLIP parameters in order of appearance

Parameter Description
iua.fna File of expected sequences
1 Seed mismatches
25 Palindrome clip threshold
7 Simple clip threshold

There is always some subjectivity in how sensitive you want your adapter (and barcode) searching to be. If the settings are too strict you might end up discarding real sequence data that only partially overlaps with the Illumina adapters. If your settings are not strict enough then you might leave partial adapters in the sequence. For short inserts (DNA fragments), the sequencing process can read through into the adapter on the other side (3'), thus adapters need to be removed prior to quality-based trimming as the change of hitting adapters (thus, removing them) is highest when the sequence is longer.


Diagnosing poor libraries

Whether a library is 'poor' quality or not can be a bit subjective. These are some aspects of the library that you should be looking for when evaluating FastQC:

  1. Does the sequencing length match what you ordered from the facility?
  2. If the sequences are shorter than expected, is adapter read-through a concern?
  3. What does the sequence quality look like for the whole length of the run? Are there any expected/unexpected regions of quality degradation?
  4. Are adapters and/or barcodes removed?
    • Look at the Per base sequence content to diagnose this.
  5. Is there unexpected sequence duplication?
    • This can occur when low-input library preparations are used.
  6. Are over-represented k-mers present?
    • This can be a sign of adapter and barcode contamination.

Understanding common issues and best practices

  1. Do I need to remove (rare) adapters?
  2. I don’t know if adapters have been removed or not
  3. How do I identify and remove adapter read-through?
  4. Identifying incomplete barcode/adapter removal
  5. Over aggressive trimming
  6. GC skew is outside of expected range

(Optional) Filtering out host DNA

Metagenome data derived from host-associated microbial communities should ideally be filtered to remove any reads originating from host DNA. This may improve the quality and efficiency of downstream data processing (since we will no longer be processing a bunch of data that we are likely not interested in), and is also an important consideration when working with metagenomes that may include data of a sensitive nature (and which may also need to be removed prior to making the data publicly available). This is especially important for any studies involving human subjects or those involving samples derived from Taonga species.

There are several approaches that can be used to achieve this. The general principle is to map your reads to a reference genome (e.g. human genome) and remove those reads that map to the reference from the dataset.

Note

This process may be more complicated if a reference genome for your host taxa is not readily available. In this case an alternative method would need to be employed (for example: predicting taxonomy via Kraken2 and then filtering out all reads that map to the phylum or kingdom of your host taxa).

This exercise provides an example using BBMap to map against a masked human reference genome and retain only those reads that do not map to the reference. Here we are mapping the quality-filtered reads against a pre-prepared human genome that has been processed to mask sections of the genome, including those that:

  • are presumed microbial contaminant in the reference
  • have high homology to microbial genes/genomes (e.g. ribosomes)
  • those that are of low complexity

This ensures that reads that would normally map to these sections of the human genome are not removed from the dataset (as genuine microbial reads that we wish to retain might also map to these regions), while all reads mapping to the rest of the human genome are removed.

Note

The same process can be used to remove DNA matching other hosts (e.g. mouse), however you would need to search if anyone has prepared (and made available) a masked version of the reference genome, or create a masked version using bbmask. The creator of BBMap has made available masked human, mouse, cat, and dog genomes. More information, including links to these references and instructions on how to generate a masked genome for other taxa, can be found within this thread.*

The masked human reference genome

The masked reference genome is available via Google drive and Zenodo. For this workshop, we have provided you with the file within the 4.evaluation/BBMask_human_reference/ directory.

Downloading your own copy of the masked genome

We can use gdown to download this file from Google drive via the command line.

To install gdown, we can use pip.

Install gdown

# Install gdown (for downloading from google drive)
module purge
module load Python/3.10.5-gimkl-2022a
pip install --user gdown

Next, download the reference. It will also be necessary to first add your local bin location to the PATH variable via the export PATH=... command, as this is where gdown is located (modify <your_username> before running the code below).

Download reference

mkdir BBMask_human_reference/
cd BBMask_human_reference/

export PATH="/home/<your_username>/.local/bin:$PATH"

gdown https://drive.google.com/uc?id=0B3llHR93L14wd0pSSnFULUlhcUk

Indexing the reference genome and read mapping with BBMap

We will cover more about read mapping in later exercises. For now, it is important to know that it is first necessary to build an index of the reference using the read mapping tool of choice. Here, we will first build a BBMap index, and then use BBMap to map the quality-filtered reads to that index, ultimately retaining only those reads that do not map to the index.

Build index reference via BBMap. We will do this by submitting the job via slurm.

Create a new script named host_filt_bbmap_index.sl using nano:

code

nano host_filt_bbmap_index.sl

Remember to update <YOUR FOLDER> to your own folder

code

#!/bin/bash -e

#SBATCH --account       nesi02659
#SBATCH --job-name      host_filt_bbmap_index
#SBATCH --partition     milan
#SBATCH --time          00:20:00
#SBATCH --mem           32GB
#SBATCH --cpus-per-task 12
#SBATCH --error         %x_%j.err
#SBATCH --output        %x_%j.out

# Working directory
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/2.fastqc/BBMask_human_reference/

# Load BBMap module
module purge
module load BBMap/39.01-GCC-11.3.0

# Build indexed reference file via BBMap
bbmap.sh ref=hg19_main_mask_ribo_animal_allplant_allfungus.fa.gz    

Save your script by pressing Ctrl + o , then exit the editor by pressing Ctrl + x .

Submit your newly created script to the scheduler as follows:

code

sbatch host_filt_bbmap_index.sl

Finally, map the quality-filtered reads to the reference via BBMap. Here we will submit the job as a slurm array, with one array job per sample.

Again, we will create a script using nano:

code

nano host_filt_bbmap_map.sl

code

#!/bin/bash -e

#SBATCH --account       nesi02659
#SBATCH --job-name      host_filt_bbmap_map
#SBATCH --partition     milan
#SBATCH --time          01:00:00
#SBATCH --mem           27GB
#SBATCH --array         1-4
#SBATCH --cpus-per-task 20
#SBATCH --error         %x_%A_%a.err
#SBATCH --output        %x_%A_%a.out

# Set up working directories
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/2.fastqc/
mkdir -p host_filtered_reads/

# Load BBMap module
module purge
module load BBMap/39.01-GCC-11.3.0

# Run bbmap
bbmap.sh -Xmx27g -t=$SLURM_CPUS_PER_TASK \
  minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 qtrim=rl trimq=10 untrim \
  in1=../3.assembly/sample${SLURM_ARRAY_TASK_ID}_R1.fastq.gz \
  in2=../3.assembly/sample${SLURM_ARRAY_TASK_ID}_R2.fastq.gz \
  path=BBMask_human_reference/ \
  outu1=host_filtered_reads/sample${SLURM_ARRAY_TASK_ID}_R1_hostFilt.fastq \
  outu2=host_filtered_reads/sample${SLURM_ARRAY_TASK_ID}_R2_hostFilt.fastq

BBMap parameters

Parameter Description
-Xmx27g Set maximum memory allocation to match #SBATCH --mem)
-t Set number of threads (CPUs) to match #SBATCH --cpus-per-task
All flags in line 22 Recommended parameters found within this thread about processing data to remove host reads
in1 / in2 Input paired-end reads 1 and 2 stored in 3.assembly/, each of which are "samples". These are the same files we will use in further lessons
path The parent directory where ref/ (our indexed and masked reference) exists
outu1 / outu2 Reads that were not mapped to our masked reference written to host_filtered_reads/

We'll submit the mapping script:

code

sbatch host_filt_bbmap_map.sl
Monitoring job progress

We can monitor our job progress using squeue --me or sacct <job_id>. This will be covered in detail as part of the main content when we evaluate assemblies.

Array jobs

Slurm array jobs automatically create a variable SLURM_ARRAY_TASK_ID for that job, which contains the array task number (i.e. between 1 and 4 in the case above). We use this to run the command on the sample that matches this array task ID. I.e. array job 3 will run the commands on "sample3" (sample${SLURM_ARRAY_TASK_ID} is read in as sample3).

The filtered reads are now available in host_filtered_reads/ for downstream use.