Skip to content

Module 07 - The what, why, and how of metadata management

What is metadata?

Metadata in its broadest form is 'the data about the data'. Metadata provides the spatio-temporal context for digital sequence information (DSI), and may be vital for interpreting and contextualising results. For biodiversity genomic data, the core metadata is defined by community standards such as the Genomics Standards Consortium and Biodiversity Information Standards, including the MIGS and MIxS specifications.

The Data Lifecycle
An overview of the minimum metadata for genomic data.

For processed data, metadata will also include information such as the software, software versions, and parameters used. Additional metadata may include associated keywords, downstream publications, funding sources, and data access and licensing details. To get an idea of metadata beyond the minimum, check out the Darwin Core terms for an extensive list.

Why should we be collecting and managing metadata?

Metadata should be recorded and managed alongside DSI to ensure that results produced using these data can be placed into the appropriate context. Collation and stewardship of metadata is also essential to ensure that data meet the requirements of the FAIR Principles, and so facilitating the traceability and future use of data. Not only that, but metadata describing the spatiotemporal context for data enables the connection of DSI to associated Indigenous communities, facilitating benefit-sharing into the future as described by the CARE Principles for Indigenous Data Sovereignty.

How can we best manage metadata?

At its core, metadata collation and stewardship all come down to the need for thorough and consistent record-taking and record-keeping throughout the research life cycle, from sample collection through to dissemination of results. Starting early will save you from headaches down the track!

Portals such as the Genomic Observatories MetaDatabase (GEOME) and the Collaborative Open Plant Omics (COPO) allow users to generate template to populate with metadata associated with DSI. By using existing templates, users can ensure that metadata is recorded in ways that are consistent with biodiversity genomics community standards.

Tools such as version control, software containers, and workflow management systems can be extremely helpful in tracking metadata during data processing and analysis. These tools can be particularly useful when shared across the research group, along with guidelines for directory structure and file naming conventions, ensuring team-wide consistency. For more on these tools, see Module 08.

Further reading

  • Crandall, E. D. et al. (2023). Importance of timely metadata curation to the global surveillance of genetic diversity. Conservation Biology, 00(e14061). https://doi.org/10.1111/cobi.14061
  • Field, D., et al. (2008). The minimum information about a genome sequence (MIGS) specification. Nature Biotechnology, 26(5), Article 5. https://doi.org/10.1038/nbt1360
  • Yilmaz, et al. (2011). Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature Biotechnology, 29(5), Article 5. https://doi.org/10.1038/nbt.1823