Quantifying Gene Expression

In order to draw conclusions about gene expression from our reads, we need to first identify which region of the genome reads originated from and then quantify the number of reads from that region. There are many tools available to carry out quantification, and we will not cover the arguments for and against each tool. We encourage you to carry out a literature search, and discuss with experienced individuals in your field, before committing to a particular quantification procedure. Here we will give a brief overview of two approaches with considerably different underlying methodology, give an example tool for each, and then demonstrate one tool for the purpose of progressing to the next stage of our workflow.

The first approach to read quantification involves aligning reads to the genome first, and then counting the number of reads that are aligned to certain genomic features (e.g., exons). This method may generally be referred to as the ‘count’ method, but it is still a form of quantification, and the output is by default at the gene level.

The second approach, which we will not do today, involves creating transcript expression estimates via pseudoalignment, often referred to informally as ‘pseudocounts’. In this method, reads are not aligned to the genome. Instead, they are pseudoaligned to kmers within the transcriptome and their abundance is estimated. While the output for these tools (e.g., Kallisto, Salmon) are by default at a transcript level, they can be converted to a gene level expression estimate.

In today’s workflow we will demonstrate genome alignment and counting using HISAT2 and featureCounts.

Alignment with HISAT2

Preparation of the genome

When carrying out alignment, the first requirement is a genome which has been indexed. Indexing is a process to organise the genome so that our alignment algorithms can match reads to the genome easily, without having to scan the entire genome.

Navigate to the Genome directory and view the contents. You should see two files, a .gtf and a .fa file. We will use the .gtf file towards the end of this episode.

cd ~/RNA_seq/Genome

ls

The .gtf file (General Transfer Format) contains information about the genome, such as gene names etc., all stored in a one-line-per-feature format, while the .fa (FASTA) file contains sequence.

We will now use hisat2 to index the genome using the hisat2-build command. We will specify the number of threads (-p 4), and specify the input file (-f for FASTA file type). The last argument is the prefix for the name of our .ht2 output file. You can use any name for this, but we recommend using something that specifies which version of the genome you used, for good scientific reproducibility.

# index file:
hisat2-build -p 4 -f Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa Saccharomyces_cerevisiae.R64-1-1.dna.toplevel

ls

We can now see many new .ht2 files, which make up the indexed version of the genome.

Alignment on the genome

Now that we have an indexed genome, we can align our sequences. To do this we will need to know:

Where the sequence information is stored (e.g., fastq files).
What kind of sequencing file we have (e.g., Single end or Paired end).
Where the indexes and genome are stored.
Where the mapping files will be stored.

Once we have that information we are ready to align our sequences. First, navigate to the RNA_seq directory, and run hisat2 on the first sample. We will use the -x flag to specify the basename of the index, the -U flag to specify the comma-separated list of files containing unpaired reads to be aligned, and the -S flag to write SAM alignment output files.

cd ~/RNA_seq

hisat2 -x Genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel -U Trimmed/SRR014335-chr1.fastq -S SRR014335.sam

Output should look something like this:

125090 reads; of these:

  125090 (100.00%) were unpaired; of these:

    15108 (12.08%) aligned 0 times

    88991 (71.14%) aligned exactly 1 time

    20991 (16.78%) aligned >1 times

87.92% overall alignment rate

Now that we have confirmed this code works for our first sample, we will use a loop to align the rest of the samples. We will make a new directory called Mapping, which we will use to store all of our mapping output files. Then navigate to the Trimmed directory and execute the for loop to align all samples.

mkdir Mapping

cd ~/RNA_seq/Trimmed

for filename in *
do
base=$(basename ${filename} .trimmed.fastq)
hisat2 -p 4 -x ../Genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel -U $filename -S ../Mapping/${base}.sam --summary-file ../Mapping/${base}_summary.txt
done

Check the output directory (Mapping) and see the different files that have been produced. You should see both .sam and _summary.txt files for each sample.

Let’s look at the SAM file format.

less SRR014335-chr1.sam

The file begins with an optional header which is used to include human-readable metadata such as the source of the data, reference sequence, method of alignment etc.,. Following the header is the alignment section. Each line contains information corresponding to the alignment of a single read. Each alignment line has 11 mandatory fields for essential mapping information and a variable number of other fields for aligner-specific information.

Converting SAM to BAM

SAM files are tab-delimited text files containing information for each individual read and its alignment to the genome. SAM files are large, so the first thing we do with them is convert them to BAM files: compressed, binary versions of the file which reduce time and can themselves be indexed for efficiency when we interact with them.

We will convert the SAM file to the BAM format using the samtools program with the view command. Use the -S flag to specify the input is a SAM file, and the -b flag to specify BAM as the output.

Note that you will see a warning “fail to read the header from…”, this is normal.

for filename in *.sam
do
base=$(basename ${filename} .sam)
samtools view -S -b ${filename} -o ${base}.bam
done

ls

Next we will sort the BAM files, which will mean our future steps are more efficient. From the samtools program we will use the sort command, with the -o flag to specify where the output file should go.

for filename in *.bam
do
base=$(basename ${filename} .bam)
samtools sort -o ${base}_sorted.bam ${filename}
done

Mapping statistics

Using the sorted BAM file, you can now easily calculate some basic mapping statistics using the samtools flagstat command.

samtools flagstat SRR014335-chr1_sorted.bam

The output will look something like this:

161251 + 0 in total (QC-passed reads + QC-failed reads)
125090 + 0 primary
36161 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
146362 + 0 mapped (90.77% : N/A)
110201 + 0 primary mapped (88.10% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Note: the many “0s” here are normal for single end sequencing. The second column designated by x + 0 is always 0, as we do not have a pair. As there are no pairs or mates in SE sequencing, the outputs specific to those parameters are shown as 0 + 0.

Basic statistics shown by flagstat will be slightly different from those in the summary file generated by HISAT2 due to different “totals” that are used for comparisons. flagstat compares the number of alignments while HISAT2 compares the number of reads mapped. This is because reads can be mapped/aligned to more than one reference location, and these reads have a “primary” and “secondary” alignment (see section 1.2 of the SAM specifications). For example, the percent overall alignment in the HISAT2 summary will be equivalent to the percent primary mapped evaluated by flagstat. To get the number of reads that aligned 0 times (summary file), the equivalent statistic from flagstat would be subtracting the number of mapped reads from the number of total alignments.

MultiQC

HISAT2 output data can be incorporated into the MultiQC report. Copy the summary files generated by HISAT2 into the MultiQC directory and re-run multiqc.

Navigate to the updated report and view it in your browser.

cd ~/RNA_seq/MultiQC
cp ../Mapping/*summary* ./
multiqc .

You might notice that the report labels the summary files as “Bowtie2”. This is because the outputs are identical to those generated by Bowtie2, and MultiQC recognises them as Bowtie2.

Read summarisation

Sequencing reads often need to be assigned to genomic features of interest after they are mapped to the reference genome. This process is often called read summarisation or read quantification. Read summarisation is required by a number of downstream analyses such as gene expression analysis and histone modification analysis. The output of read summarisation is a count table, in which the number of reads assigned to each feature in each library is recorded. In our case each feature is an exon. To carry out counting we will use the featureCounts tool from the Subread package.

Navigate to the RNA_seq directory, create a new directory called Counts and cd into that directory. From there run the featureCounts command which will require five different flags. We will use the -a flag to specify the annotation file, in this case the .gtf file we noticed earlier. The -o flag names the output file (which we will call yeast_counts.txt), while the -T flag specifies the number of threads/CPUs used for mapping. The -t flag is used to control the feature type in the GTF annotation file, we will use exon (which is also the default). Finally, the -g flag specifies the attribute type in the GTF file. We will use the gene_id option (again, the default).

cd ~/RNA_seq

mkdir Counts

cd Counts

featureCounts -a ../Genome/Saccharomyces_cerevisiae.R64-1-1.99.gtf -o ./yeast_counts.txt -T 2 -t exon -g gene_id ../Mapping/*sorted.bam

Update the MultiQC report by copying over the output data in Counts.

cd ~/RNA_seq/MultiQC

cp ../Counts/* .

multiqc .

We have now generated counts! We can now take these through to the next stage of our analysis: preparing to identify differentially expressed genes.

Extra: Automating and optimising your workflow

Here we have been running each step of fastqc, trimming, indexing, aligning and sorting directly on the command line. This is not always best practise, for a few reasons, and what you will likely need to do is translate this code into a script. For one, it is easier to reproduce and automate your workflow when everything you do is in a script. But perhaps more importantly, when working on a HPC environment, your direct command line inputs are operating on the head node or log in node. Many of these software used require more CPUs/memory than can be allocated on the head node and therefore you need to submit your job to the HPC job scheduler (e.g., SLURM or PBS) to run the job more efficiently on allocated compute resources.

See here our GA workshop on Bash Scripting and HPC Job Scheduler.

As a quick example, this below is what a slurm script may look like for running the index and align steps. Note the use of things like relative paths – you need to adjust the paths to work from the directory in which your script is submitted from.

Slurm scripts are text files, and use the .sl extension by convention. You can use nano to create a script.

nano index-align.sl

#!/bin/bash 

#SBATCH --job-name      index-and-align
#SBATCH --account       nesi02659
#SBATCH --time          00:10:00                 #Format is DD-HH:MM:SS
#SBATCH --cpus-per-task 8
#SBATCH --mem           10G                      
#SBATCH --output        slurmjob.%j.out
#SBATCH --error         slurmjob.%j.err 

# Load required modules 
module load HISAT2/2.2.0-gimkl-2020a


# Index genome
hisat2-build \
-p 2 \
-f ref_genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa \
ref_genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel

# Align to indexed genome
for filename in trimmed_reads/*.fastq
do
      # Extract base name
      base=$(basename ${filename} .fastq)

      # Align to the reference genome
      hisat2 \
      -p 2 \
      -x ref_genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel \
      -U $filename \
      -S results/sam/${base}.sam \
      --summary-file results/sam/${base}.summary.txt

done

This script only includes indexing and aligning, but you could include every step in the workflow in one script if you wish. Note that you can use a backslash (\) to separate options/flags in your code across multiple lines to improve readability.

To submit the script, from your working directory you can type:

sbatch index-align.sl

To monitor the progress of the job, you can type:

squeue --me