Organisation and tidy data
Good data organisation is the foundation of any research project. It not only sets you up well for an analysis, but it also makes it easier to come back to the project later and share with collaborators, including your most important collaborator - future you.
Organising a project that includes sequencing involves many components. There’s the experimental setup and conditions metadata, measurements of experimental parameters, sequencing preparation and sample information, the sequences themselves and the files and workflow of any bioinformatics analysis. So much of the information of a sequencing project is digital, and we need to keep track of our digital records in the same way we have a lab notebook and sample freezer. In this lesson, we’ll go through the project organisation and documentation that will make an efficient bioinformatics workflow possible. Not only will this make you a more effective bioinformatics researcher, it also prepares your data and project for publication, as grant agencies and publishers increasingly require this information.
Metadata
When we think about the data for a sequencing project, we often start by thinking about the sequencing data that we get back from the sequencing centre. However, equally or more important is the data you’ve generated about the sequences before it ever goes to the sequencing centre. This is the data about the data, often called the metadata. Without the information about what you sequenced, the sequence data itself is useless.
Structuring data in spreadsheets
Regardless of the type of data you’re collecting, there are standard ways to enter that data into the spreadsheet to make it easier to analyse later. We often enter data in a way that makes it easy for us as humans to read and work with it, because we’re human! Computers need data structured in a way that they can use it. So to use this data in a computational workflow, we need to think like computers when we use spreadsheets.
The cardinal rules of using spreadsheet programs for data:
- Leave the raw data raw - do not change it!
- Put each observation or sample in its own row.
- Put all your variables in columns - the thing that vary between samples, like ‘strain’ or ‘DNA-concentration’.
- Have column names be explanatory, but without spaces. Use ‘-’, ’_’ or camel case instead of a space. For instance ‘library-prep-method’ or ‘LibraryPrep’is better than ’library preparation method’ or ‘prep’, because computers interpret spaces in particular ways.
- Do not combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way you’ll want to be able to use or sort that data. For example, instead of having a column with species and strain name (e.g. E. coli K12) you would have one column with the species name (E. coli) and another with the strain name (K12). Depending on the type of analysis you want to do, you may even separate the genus and species names into distinct columns.
- Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.
Tidy data principles
The simplest principle of Tidy data is that we have one row in our spreadsheet for each observation or sample, and one column for every variable that we measure or report on. As simple as this sounds, it’s very easily violated. Most data scientists agree that significant amounts of their time is spent tidying data for analysis.
The R package Tidyverse was specifically developed to deal with data cleaning and manipulation for analysis. Sidenote: a Genomics Aotearoa workshop on data manipulation with tidyverse tools is in development!
There are lots of different conventions people use to represent missing values. You should especially avoid representing a missing value as a 0 (zero), as zero implies that a measurement was taken and the result was zero.
The best option is to leave missing values as blank or designate them as an NA in your spreadsheet.
See below a few common values used for indicating a missing value:



