The Sarek pipeline
!!! clipboard-list “Objectives”
- Understand the Sarek pipeline structure and default usage
- Understand the levels of customisation available for nf-core pipelines
- Use the nf-core documentation to select appropriate parameters for a run command
- Write and run a nf-core sarek command on the command line
The Sarek pipeline
nf-core/sarek is a pipeline designed to detect variants on whole genome or targeted sequencing data. Initially designed for Human, and Mouse, it can work on any species with a reference genome. Sarek can also handle tumour/normal pairs and could include additional relapses.
The pipeline makes use of Docker/Singularity containers, making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
Depending on the options and samples provided, the pipeline can currently perform the following:
- Form consensus reads from UMI sequences (
fgbio
) - Sequencing quality control and trimming (
FastQC
,fastp
) - Map Reads to Reference (
BWA-mem
orBWA-mem2
ordragmap
) - Process BAM file (
GATK MarkDuplicates
,GATK BaseRecalibrator
,GATK ApplyBQSR
) - Summarise alignment statistics (
samtools stats
,mosdepth
) - Variant calling (enabled by
--tools
, see compatibility):HaplotypeCaller
,freebayes
,mpileup
,Strelka2
,DeepVariant
,Mutect2
,Manta
,TIDDIT
,ASCAT
,Control-FREEC
,CNVkit
, and / orMSIsensor-pro
- Variant filtering and annotation (
SnpEff
,Ensembl VEP
) - Summarise and represent QC (
MultiQC
)
nf-core pipelines are frequently represented as subway maps. The nf-core/sarek subway map is shown below and is a good place to start when first understanding how the pipeline works.
Download the Sarek pipeline
There are multiple ways you can download and store a copy of a nf-core pipeline.
Firstly, you could use the nextflow pull
command. By default, if you you the nextflow run
command to execute a pipeline from github it will also pull the pipeline. In both of these cases the pipeline will be stored in a hidden directory in your home directory.
Secondly, you could clone a copy of the pipeline using the standard git clone
command, e.g., git clone https://github.com/nf-core/sarek.git
. This will download the pipeline to your current working directory.
Finally, you could use the nf-core download
utility to download a copy of the pipeline. This will give the the option to download the pipeline code, the required singularity images, and the institutional configs from the nf-core github repository. This method can be especially helpful if you are working offline and want to move all of the pipeline code and tooling to a different machine.
Getting started
All nf-core pipelines are provided with comprehensive documentation that explain what the default pipeline structure entails and options for customising this based on your needs. It is important to remember that nf-core pipelines typically do not include all possible tool parameters. Instead, they provide a sensible set of parameters that are suitable for most use cases.
The number and type of parameters an nf-core pipeline accepts differ between pipelines. The recommended (typical) run command and all the parameters available for the nf-core/sarek pipeline can be viewed using the --help
flag:
nextflow run nf-core/sarek -r 3.2.3 --help
!!! note “Revision 3.2.3”
The Sarek pipeline is always improving but we want to ensure that the results of this workshop are reproducible. To ensure this, we will use a specific version (3.2.3) of the pipeline and the revision flag (`-r`).
At the top of the help output, you will see the recommended run command:
nextflow run nf-core/sarek --input samplesheet.csv --genome GATK.GRCh38 -profile docker
It outlines the requirement for three things:
- An input samplesheet (
--input
) - A reference genome (
--genome
) - A software management profile (
--profile
)
!!! tip “Hyphens matter”
Nextflow-specific parameters use one (`-`) hyphen, whereas pipeline-specific parameters use two (`--`).
More information about Sarek
There is extensive information about nf-core pipelines on the nf-core website. The dedicated Sarek pipeline page is the best resource for information about the pipeline and how to execute it.
If you have specific questions that are not included in the documentation you can join the nf-core Slack workspace and ask in the #sarek
channel.
Testing a pipeline
Before running a pipeline on your own data, it is a good idea to test the pipeline on a small dataset. This allows you to check that the pipeline is working as expected without having to wait for a long time for the pipeline to complete.
The test
profile will run the pipeline on a small test dataset that is included with the pipeline code.
nextflow run nf-core/sarek -profile test,singularity --outdir test_sarek -r 3.2.3
!!! question “Exercise”
Check that Sarek is working by running the pipeline with the `test` profile.
??? success "Solution"
Run the test profile:
```bash
nextflow run nf-core/sarek -profile test,singularity --outdir test_sarek -r 3.2.3
```
!!! square-pen “Test data”
The `--input` and `--genome` parameters are not required when using the [`test` profile](https://github.com/nf-core/sarek/blob/3.2.3/conf/test.config). The test data and small reference files are included with the pipeline code and are automatically used when the `test` profile is specified.
!!! circle-info “”
!!! cboard-list-2 “Key points”
- nf-core pipelines are provided with sensible default settings and required inputs.
- The `--help` flag can be used to view the recommended run command and all available parameters.
- The `test` profile can be used to show that a pipeline is working as expected.