Chapter 3 Guidelines for Good Metadata

3.1 Learning Objectives

Learning objectives This chapter will demonstrate how to: Understand what metadata are and why they are so critical. Learn the basics of creating crystal clear, readable metadata

3.2 What are metadata?

Metadata are critically important descriptive information about your data.

Without metadata, the data themselves are useless or at best vastly limited.

Metadata describe how your data came to be, what organism or patient the data are from and include any and every relevant piece of information about the samples in your data set.

Question: What are metadata? Answer: Anything and everything that should be known about your samples! Samples labeled A-H are in test tubes. A corresponding spreadsheet has metadata such as mouse id, processing date, treatment and etc. The researcher says ‘I know everything I need to know about these samples from their metadata!’

Metadata includes but isn’t limited to, the following example categories:

At this time it’s important to note that if you work with human data or samples, your metadata will likely contain personal identifiable information (PII) and protected health information (PHI). It’s critical that you protect this information! For more details on this, we encourage you to see our course about data management.

3.3 How to create metadata?

Where do these metadata come from? The notes and experimental design from anyone who played a part in collecting or processing the data and its original samples. If this includes you (meaning you have collected data and need to create metadata) let’s discuss how metadata can be made in the most useful and reproducible manner.

3.3.1 The goals in creating your metadata:

3.3.1.1 Goal A: Make it crystal clear and easily readable by both humans and computers!

Some examples of how to make your data crystal clear:

Look out for typos and spelling errors!
Don’t use acronyms unless you need to and then if you do need to make sure to explain what the acronym means.
Don’t add extraneous information – perhaps items that are relevant to your lab internally but not meaningful to people outside of your lab. Either explain the significance of such information or leave it out.
Make your data tidy.

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

Every column is a variable.

Every row is an observation.

Every cell is a single value.

3.3.1.2 Goal B: Avoid introducing errors into your metadata in the future!

Toward these two goals, this excellent article by Broman & Woo discusses metadata design rules. We will very briefly cover the major points here but highly suggest you read the original article.

Be Consistent - Whatever labels and systems you choose, use it universally. This not only means in your metadata spreadsheet but also anywhere you are discussing your metadata variables.
Choose good names for things - avoid spaces, special characters, or within the lab jargon.
Write Dates as YYYY-MM-DD - this is a global standard and less likely to be messed up by Microsoft Excel.
No Empty Cells - If a particular field is not applicable to a sample, you can put NA but empty cells can lead to formatting errors or just general confusion.
Put Just One Thing in a Cell - resist the urge to combine variables into one, you have no limit on the number of metadata variables you can make!
Make it a Rectangle - This is the easiest way to read data, for a computer and a human. Have your samples be the rows and variables be columns.
Create a Data Dictionary - Have somewhere that you describe what your metadata mean in detailed paragraphs.
No Calculations in the Raw Data Files - To avoid mishaps, you should always keep a clean, original, raw version of your metadata that you do not add extra calculations or notes to.
Do Not Use Font Color or Highlighting as Data - This only adds to confusion to others if they don’t understand your color coding scheme. Instead create a new variable for anything you might be tempted to color code.
Make Backups - Metadata are critical, you never want to lose them because of spilled coffee on a computer. Keep the original backed up in a multiple places. We recommend keeping writing your metadata in something like GoogleSheets because it is both free and also saved online so that it is safe from computer crashes.
Use Data Validation to Avoid Errors - set data types to have googlesheets or excel check that the data in the columns is the type of data it expects for a given variable.

Note that it is very dangerous to open gene data with Excel. According to Ziemann, Eren, and El-Osta (2016), approximately one-fifth of papers with Excel gene lists have errors. This happens because Excel wants to interpret everything as a date. We strongly caution against opening (and saving afterward) gene data in Excel. ‘Approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions’ Ziemann, Eren, El-Osta, 2016. On the left, a meme that shows Excel asking ‘is this a date?’ in response to seeing ‘any data at all’.

3.3.2 To recap:

Rules for creating metadata (from Broman & Woo, 2017) Be Consistent. Choose good names for things. Write Dates as YYYY-MM-DD.No Empty Cells. Put Just One Thing in a Cell. Make it a Rectangle

Rules for creating metadata continued (from Broman & Woo, 2017). Create a Data Dictionary. No Calculations in the Raw Data Files. Do Not Use Font Color or Highlighting as Data. Make Backups. Use Data Validation to Avoid Errors

If you are not the person who has the information needed to create metadata, or you believe that another individual already has this information, make sure you get ahold of the metadata that correspond to your data. It will be critical for you to have to do any sort of meaningful analysis!

References

Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biology 17 (1). https://doi.org/10.1186/s13059-016-1044-7.