Genomic Data Carpentry (Aotearoa edition)

Getting started with genomics

This is a beginner-friendly workshop, designed to get you started with the world of genomics. Whatever you’re doing—whether it’s transcriptomics, genome assembly, variant calling, metagenomics, or something else—if you will be using genomic data this workshop is for you!

What’s covered in this workshop

Organisation—from messy lab books and excel spreadsheets to tidy, computer-friendly data
Working with sequencing facilities and understanding genomic data types
Data storage repositories, public services and facilities, and principles of FAIR and CARE
Quality control, wrangling of raw reads and an introduction to genomic terminology

What’s NOT covered in this workshop

Basic descriptions of biological and genetic concepts (i.e., we assume the learner is already familiar with DNA/RNA, PCR, transcription, etc. to an undergraduate level).
Non-NGS sequencing and services (e.g., Sanger sequencing, qPCR, genotyping, probe-based applications such as microarrays and NanoString nCounter).
Genomic analysis workflows (beyond the basics of initial quality checks of raw reads)
The basics of cluster or HPC resourcing and specialised software (e.g., we do not cover schedulers such as SLURM, partitions/CPUs/GPUs, chosing compute allocation allowance). See our workshop on Introduction to Bash Scripting and HPC Job Scheduler for this.
Using shell or other bioinformatic tools, beyond the very basics (e.g., we do not cover writing/submitting bash scripts, modules, accessing the cluster using ssh).

Glossary

Term	Definition
Adapter	Short synthetic DNA sequence ligated to the DNA molecule during library prep which allows the molecule to bind to the flow cell during sequencing and also provides a primer binding site
bp	Base pair
HCS	High Capacity Storage
HPC	High Performance Computing
Index	Also known as a barcode. Short unique sequence added to each DNA molecule in one library, allowing the identification of that library/sample and thereby enabling pooling of the libraries (one run = cheaper).
Mb	Megabase pair (1,000,000 bp)
MB	Megabyte
Multiplexing	Sequencing multiple samples simultaneously in one run by combining libraries into one pool. Samples (i.e., libraries) are de-multiplexed (separated) in silico usually by the technician, based on unique indices.
Gb	Gigabase pair (1,000,000,000 bp)
GB	Gigabyte
NGS	Next-generation sequencing
SE / PE	Single-end / Paired-end

Attribution

Parts of this workshop were adapted from and inspired by content from The Carpentries Data Carpentry lessons on Genomics.

All Carpentries instructional material is made available under the Creative Commons Attribution license CC BY 4.0. The material in this workshop is not endorsed by the Carpentries and has been adapted by Genomics Aotearoa for our own teaching purposes.

In this workshop, the following lessons were adapted from The Carpentries Data Carpentry in the manner stated below:

Organisation and tidy data section has re-used material from Project Organization and Management for Genomics.
Data Wrangling and Processing for Genomics. Currently not using this workshop but likely to re-use.

Material is used in this workshop from our other Genomics Aotearoa workshops, as below:

Material from our Introduction to Shell workshop is re-used in the Genomic data wrangling and processing section.
Material from our Introduction to R workshop - section on working with spreadsheets is re-used in Organisation and tidy data.

Other material used in this workshop:

Illumina library prep molecular workflow diagram in Planning for submission from Illumina Stranded mRNA Prep, Ligation Data Sheet. M-GL-02143 v1.0
Bioanalyzer trace of final mRNA library in Planning for submission from Illumina Stranded mRNA Prep, Ligation Protocol Document # 1000000124518 v04.
PacBio SMRTbell adapters image in Planning for submission from Template Preparation and Sequencing Guide P/N 000-710-821-13.
Adapter dimer image from Illumina general knowledge base used in Planning for submission.

Definitions:

Re-used material: Almost word-for-word, including images, with minor wording or styling modifications.
Minimally-adapted material: Inspired by stylistic choices and general workflow, but material is primarily developed by Genomics Aotearoa.

Made with ❤️ and Quarto