Genomic Data Carpentry (Aotearoa edition)

Getting started with genomics
This is a beginner-friendly workshop, designed to get you started with the world of genomics. Whatever you’re doing—whether it’s transcriptomics, genome assembly, variant calling, metagenomics, or something else—if you will be using genomic data this workshop is for you!
What’s covered in this workshop
- Organisation—from messy lab books and excel spreadsheets to tidy, computer-friendly data
- Working with sequencing facilities and understanding genomic data types
- Data storage repositories, public services and facilities, and principles of FAIR and CARE
- Quality control, wrangling of raw reads and an introduction to genomic terminology
What’s NOT covered in this workshop
- Basic descriptions of biological and genetic concepts (i.e., we assume the learner is already familiar with DNA/RNA, PCR, transcription, etc. to an undergraduate level).
- Non-NGS sequencing and services (e.g., Sanger sequencing, qPCR, genotyping, probe-based applications such as microarrays and NanoString nCounter).
- Genomic analysis workflows (beyond the basics of initial quality checks of raw reads)
- The basics of cluster or HPC resourcing and specialised software (e.g., we do not cover schedulers such as SLURM, partitions/CPUs/GPUs, chosing compute allocation allowance). See our workshop on Introduction to Bash Scripting and HPC Job Scheduler for this.
- Using shell or other bioinformatic tools, beyond the very basics (e.g., we do not cover writing/submitting bash scripts, modules, accessing the cluster using ssh).
Glossary
| Term | Definition |
|---|---|
| Adapter | Short synthetic DNA sequence ligated to the DNA molecule during library prep which allows the molecule to bind to the flow cell during sequencing and also provides a primer binding site |
| bp | Base pair |
| HCS | High Capacity Storage |
| HPC | High Performance Computing |
| Index | Also known as a barcode. Short unique sequence added to each DNA molecule in one library, allowing the identification of that library/sample and thereby enabling pooling of the libraries (one run = cheaper). |
| Mb | Megabase pair (1,000,000 bp) |
| MB | Megabyte |
| Multiplexing | Sequencing multiple samples simultaneously in one run by combining libraries into one pool. Samples (i.e., libraries) are de-multiplexed (separated) in silico usually by the technician, based on unique indices. |
| Gb | Gigabase pair (1,000,000,000 bp) |
| GB | Gigabyte |
| NGS | Next-generation sequencing |
| SE / PE | Single-end / Paired-end |
Attribution
Parts of this workshop were adapted from and inspired by content from The Carpentries Data Carpentry lessons on Genomics.
All Carpentries instructional material is made available under the Creative Commons Attribution license CC BY 4.0. The material in this workshop is not endorsed by the Carpentries and has been adapted by Genomics Aotearoa for our own teaching purposes.
In this workshop, the following lessons were adapted from The Carpentries Data Carpentry in the manner stated below:
Organisation and tidy data section has re-used material from Project Organization and Management for Genomics.
Data Wrangling and Processing for Genomics. Currently not using this workshop but likely to re-use.
Material is used in this workshop from our other Genomics Aotearoa workshops, as below:
- Material from our Introduction to Shell workshop is re-used in the Genomic data wrangling and processing section.
- Material from our Introduction to R workshop - section on working with spreadsheets is re-used in Organisation and tidy data.
Other material used in this workshop:
- Illumina library prep molecular workflow diagram in Planning for submission from Illumina Stranded mRNA Prep, Ligation Data Sheet. M-GL-02143 v1.0
- Bioanalyzer trace of final mRNA library in Planning for submission from Illumina Stranded mRNA Prep, Ligation Protocol Document # 1000000124518 v04.
- PacBio SMRTbell adapters image in Planning for submission from Template Preparation and Sequencing Guide P/N 000-710-821-13.
- Adapter dimer image from Illumina general knowledge base used in Planning for submission.
Definitions:
- Re-used material: Almost word-for-word, including images, with minor wording or styling modifications.
- Minimally-adapted material: Inspired by stylistic choices and general workflow, but material is primarily developed by Genomics Aotearoa.
