Genomic Data Carpentry (Aotearoa edition)

Getting started with genomics
This is a beginner-friendly workshop, designed to get you started with the world of genomics. Whatever you’re doing—whether it’s transcriptomics, genome assembly, variant calling, metagenomics, or something else—if you will be using genomic data this workshop is for you!
Prerequisites
Learners are expected to have a basic (undergraduate) level understanding of biological and genetic concepts, but no familiarity with genomics or bioinformatic/computational skill is required.
What’s covered in this workshop
- Ethical data collection from a New Zealand perspective
- Organisation—from messy lab books and excel spreadsheets to tidy, computer-friendly data
- Working with sequencing facilities and understanding genomic data types
- Data storage repositories and public services and facilities
- Quality control, wrangling of raw reads and an introduction to genomic terminology
What’s NOT covered in this workshop
- Basic descriptions of biological and genetic concepts.
- Traditional sequencing and services (e.g., Sanger sequencing, qPCR, genotyping, probe-based applications such as microarrays and NanoString nCounter).
- Genomic analysis workflows. We have multiple dedicated workshops for genomic pipelines; see our portfolio here.
- Using shell or other bioinformatic tools. See our workshops on Introduction to shell and Introduction to R to get you started on this.
- Understanding the cluster, HPC resourcing and specialised software (e.g., we do not cover schedulers such as SLURM, partitions/CPUs/GPUs, choosing compute allocation allowance). See our workshop on Introduction to Bash Scripting and HPC Job Scheduler for this.
Glossary
| Term | Definition |
|---|---|
| Adapter | Short synthetic DNA sequence ligated to the DNA molecule during library prep which allows the molecule to bind to the flow cell during sequencing and also provides a primer binding site |
| bp | Base pair |
| DGE | Differential Gene Expression (analysis) |
| HCS | High Capacity Storage |
| HPC | High Performance Computing |
| HiFi | High fidelity (PacBio) |
| Index | Also known as a barcode. Short unique sequence added to each DNA molecule in one library, allowing the identification of that library/sample and thereby enabling pooling of the libraries for sequencing (one run = cheaper). |
| Gb | Gigabase pair (1,000,000,000 bp) |
| GB | Gigabyte (file size / storage size) |
| Mb | Megabase pair (1,000,000 bp) |
| MB | Megabyte (file size / storage size) |
| Multiplexing | Sequencing multiple samples simultaneously in one run by combining libraries into one pool. Samples (i.e., libraries) are de-multiplexed (separated) in silico usually by the technician, based on unique indices. |
| NGS | Next-generation sequencing |
| ONT | Oxford Nanopore Technologies (sequencing company). Often referred to as “Nanopore”. |
| PacBio | Pacific Biosystems (sequencing company) |
| Resequencing | Sequencing part of an individual’s genome in order to detect sequence differences between the individual and the standard genome of the species. Often performed to detect SNPs, genotypes, variants. |
| SE / PE | Single-end / Paired-end |
| SMRT | Single molecule real time (PacBio) |
Attribution
This workshop was developed by Dr Chloé van der Burg for the Genomics Aotearoa Bioinformatics Training Programme.
Parts of this workshop were re-used or adapted from The Carpentries Data Carpentry lessons on Genomics.
All Carpentries instructional material is made available under the Creative Commons Attribution license CC BY 4.0. The material in this workshop is not endorsed by the Carpentries and has been adapted by Genomics Aotearoa for our own teaching purposes.
In this workshop, the following lessons were adapted from The Carpentries Data Carpentry in the manner stated below:
Organisation and tidy data section has re-used material from Data Carpentry: Project Organization and Management for Genomics.
XXX section here has has re-used material from Data Carpentry: Data Wrangling and Processing for Genomics. Currently not using this workshop but likely to re-use.
This workshop has been adapted from the general workflow of the Data Carpentry: Genomics Workshop Overview workshop, which includes the above lessons.
Material in this workshop was also re-used from our other Genomics Aotearoa workshops, which includes:
NOTE: Some of these workshops include attribution to other source materials, see the attribution notices enclosed within.
- Material from our Introduction to Shell workshop is re-used in the Genomic data wrangling and processing section.
- Material from our Introduction to R workshop - section on working with spreadsheets is re-used in Organisation and tidy data.
Diagrams and images were also re-used in this workshop from online reference material, as follows:
- Illumina library prep molecular workflow diagram in Preparing for sequencing from Illumina Stranded mRNA Prep, Ligation Data Sheet. M-GL-02143 v1.0
- Bioanalyzer trace of final mRNA library in Preparing for sequencing from Illumina Stranded mRNA Prep, Ligation Protocol Document # 1000000124518 v04.
- PacBio SMRTbell adapters image in Preparing for sequencing from Template Preparation and Sequencing Guide P/N 000-710-821-13.
- Adapter dimer image from Illumina general knowledge base used in Preparing for sequencing.
Definitions:
- Re-used material: Almost word-for-word, including images, with minor wording or styling modifications.
- Minimally-adapted material: Inspired by stylistic choices and general workflow, but material is primarily developed by Genomics Aotearoa.
