Skip to content

1. Alignment and feature counting with Cell Ranger

The first step in the analysis of single cell RNAseq data is to align the sequenced reads against a genomic reference and then use a transcriptome annotation to generate read counts for each feature of interest. Typically for scRNAseq the features of interest are genes.

Due to time constraints, this will not be covered in today's workshop. We will begin this workshop with the filtered feature count output files generated by the Cell Ranger count function.

All the information relating to alignment and feature counting is contained below for you to read in your own time.

Indexing the reference genome and running CellRanger count

There are a variety of tools for doing alighment and feature counting and your choice will depend in part on the method by which the library was generated. For data generated using the 10x-Chromium method data the most common approach is to use the Cell Ranger tool provided by 10x. This not only carries out the alignment and feature counting, but will also:

  • Call cells, i.e. filter the raw matrix to remove droplets that do not contain cells
  • Generate a very useful report in html format, which will provide some QC metrics and an initial look at the data
  • Generate a “cloupe” file, which can be opened using the 10x Loupe Browser software to further explore the data.

Cell Ranger is computationally very intensive, you will not be able to run it on a laptop or standard desktop computer. You will need access to, for example, a high performance computing (HPC) cluster, a server or other cloud-based computational resource with sufficient power - talk to your local IT support.

Alternative methods include:

  • STAR solo - this tool is built into the general purpose STAR aligner (Cell Ranger actually uses STAR under the hood). This will generate outputs very similar to Cell Ranger minus the cloupe file and the QC report. The advantage over Cell Ranger is that it is much less computationally intensive and will run with lower memory requirements in a shorter time.
  • Alevin - This tool is based on the popular Salmon tool for bulk RNAseq feature counting. Alevin supports both 10x-Chromium and Drop-seq derived data.

For the purposes of this course, seeing as we are working with 10x-Chromium derived data, we will use Cell Ranger. As detailed instructions are available on the Cell Ranger pages of the 10x website, this chapter will not be comprehensive in terms of all options, but should provide a brief overview.

Index the reference genome

To index our genome, we can use the cellranger mkref tool with the following arguments (replacing {...} with the relevant piece of information):

cellranger mkref \
  --fasta={GENOME FASTA} \
  --genes={ANNOTATION GTF} \
  --genome={OUTPUT FOLDER FOR INDEX} \
  --nthreads={CPUS}
where:

  • GENOME FASTA is a file containing the reference genome in FASTA format.
  • ANNOTATION GTF is a file containing the transcript annotation file in GTF format.
  • OUTPUT FOLDER FOR INDEXis a name for the output folder containing the new reference package (you do not need to create this folder).
  • CPUS - Is the number of CPUs we would like CellRanger to use. The more CPUs CellRanger can use, the faster the job (up to a point).

code

cd Data/references

module purge && module load CellRanger/7.1.0

cellranger mkref \
  --fasta=Homo_sapiens.GRCh38.dna.chromosome.21.fa \
  --genes=gencode.v41.primary_assembly.annotation.chr21.gtf \
  --genome=cellranger_index \
  --nthreads 4
output
['/scale_wlg_persistent/filesets/opt_nesi/CS400_centos7_bdw/CellRanger/7.1.0/bin/rna/mkref', '--fasta=Homo_sapiens.GRCh38.dna.chromosome.21.fa', '--genes=gencode.v41.primary_assembly.annotation.chr21.gtf', '--genome=cellranger_index', '--nthreads=7']
Creating new reference folder at /scale_wlg_persistent/filesets/project/nesi02659/sc-rna/Data/references/cellranger_index
...done

Writing genome FASTA file into reference folder...
...done

Indexing genome FASTA file...
...done

Writing genes GTF file into reference folder...
...done

Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
Aug 05 19:27:22 ..... started STAR run
Aug 05 19:27:22 ... starting to generate Genome files
Aug 05 19:27:23 ... starting to sort Suffix Array. This may take a long time...
Aug 05 19:27:24 ... sorting Suffix Array chunks and saving them to disk...
Aug 05 19:27:52 ... loading chunks from disk, packing SA...
Aug 05 19:27:53 ... finished generating suffix array
Aug 05 19:27:53 ... generating Suffix Array index
Aug 05 19:27:55 ... completed Suffix Array index
Aug 05 19:27:55 ..... processing annotations GTF
Aug 05 19:27:55 ..... inserting junctions into the genome indices
Aug 05 19:27:58 ... writing Genome to disk ...
Aug 05 19:27:58 ... writing Suffix Array to disk ...
Aug 05 19:27:58 ... writing SAindex to disk
Aug 05 19:27:58 ..... finished successfully
...done.

Writing genome metadata JSON file into reference folder...
Computing hash of genome FASTA file...
...done

Computing hash of genes GTF file...
...done

...done

>>> Reference successfully created! <<<

You can now specify this reference on the command line:
cellranger --transcriptome=/scale_wlg_persistent/filesets/project/nesi02659/sc-rna/Data/references/cellranger_index ...

Cell Ranger count

Preparing the raw fastq files

To run the Cell Ranger count function, the fastq files for all samples to be processed should be placed in a single directory. Cell Ranger will be run separately on each sample. You will need to provide the software with the name of the sample to be processed (e.g. SRR9264343). Cell Ranger expects the file names of the fastq files to follow a specific convention so that it can correctly identify the files for the specific sample. If the files names of your fastqs do not match this convention you will need to rename them.

The convention is the default convention used by bcl2fastq (the tool that converts the raw sequencing data in bcl format into fastq files):

<SampleName>_S<SampleNumber>_L00<Lane>_<Read>_001.fastq.gz
  • <SampleName> - An identifier for the sample, this is what Cell Ranger uses to determine which fastq files to combine into a single sample.
  • <SampleNumber> - This is the sample number based on the order that samples were listed in the sample sheet used when running bcl2fastq. This is not important for Cell Ranger, other than it indicates the end of the Sample Name, you can set all your samples to S1.
  • <Lane> - The lane number. If your sequencing was run across multiple lanes, then you may have multiple sets of fastqs for a single sample with different lane numbers.
  • <Read> - The read type: R1 for Read 1, R2 for Read 2, and index reads are I1 and I2. 001 - The last segment is always 001.

e.g. for a single sample in the Caron data set we have:

    SRR9264343_S0_L001_I1_001.fastq.gz
    SRR9264343_S0_L001_R1_001.fastq.gz
    SRR9264343_S0_L001_R2_001.fastq.gz

Running cellranger count ( NOTE: this takes time ---)

The minimum information require to run cellranger count is:

  • --id - A sample ID. This is used for naming the outputs
  • --transcriptome- the directory containing the Cell Ranger reference
  • --fastqs - the directory containing the fastq files This will process all fastq files in the --fastqs directory into a single sample. If you have multiple samples in a single directory then you need to add:

  • --sample - the SampleName from the fastq files as above. In addition, Cell Ranger is very computationally intensive, you will usually be wanting to run it on a high performance cluster or server. It will greedily attempt to use all resources it can find available, and so it is advisable to set limits to the resources it will use:

  • --localcores - the number of processors Cell Ranger should use

  • --localmem - the amount of memory, in Gigabytes, Cell Ranger should use.

A complete command for processing on of the Caron samples might be:

cellranger count --id={OUTPUT_SAMPLE_NAME} \
                 --transcriptome={DIRECTORY_WITH_REFERENCE} \
                 --fastqs={DIRECTORY_WITH_FASTQ_FILES} \
                 --sample={NAME_OF_SAMPLE_IN_FASTQ_FILES} \
                 --localcores={NUMBER_OF_CPUS} \
                 --localmem={RAM_MEMORY}

code

cellranger count \
  --id=ETV6_RUNX1_rep1 \
   --transcriptome=Data/references/cellranger_index \
   --fastqs=Data/fastq/ \
   --sample=SRR9264343 \
   --localcores=8 \
   --localmem=24
Slurm-script - This will run for about 9 minutes
#!/bin/bash -e 

#SBATCH --account nesi02659
#SBATCH --job-name cell-ranger-count
#SBATCH --cpus-per-task 16
#SBATCH --mem 24G
#SBATCH --time 00:20:00 
#SBATCH --output slurmlog/%j.out

module purge
module load CellRanger/7.1.0

cellranger count \
            --id=ETV6_RUNX1_rep1 \
            --transcriptome=Data/references/cellranger_index \
            --fastqs=Data/fastq/ \
            --sample=SRR9264343 \
            --localcores=${SLURM_CPUS_PER_TASK} \
            --localmem=22
ls -F ETV6_RUNX1_rep1/outs/
analysis/                    filtered_feature_bc_matrix.h5  possorted_genome_bam.bam      raw_feature_bc_matrix.h5
cloupe.cloupe                metrics_summary.csv            possorted_genome_bam.bam.bai  web_summary.html
filtered_feature_bc_matrix/  molecule_info.h5               raw_feature_bc_matrix/