Chapter 18 Data Submission Tips
Uploading a dataset to a data repository is a great step toward sharing your data! But, the dataset uploaded is unclear and unusable it might as well not been uploaded in the first place.
Keep in mind that although you may understand the ins and outs of your dataset and project, its likely that others who look at your data will not understand your notation.
To make your data truly shared, you need to take the time to make sure it is well-organized and well-described! There are two files you should make sure to include to help describe and organize your data project:
- A main README file that orients others to what is included in your data.
- A metadata file that samples that are included, how they are connected, and when appropriate following privacy ethics, describes clinical features.
18.0.1 Use consistent and clear names
- Make sure that sample and data IDs used are consistent across the project - make sure to include a metadata file that describes in detail your samples in a way that is clear without any prior knowledge of the project.
- Sample and data IDs should keep with standard formatting otherwise known in the field.
- Features names should avoid using genomic coordinates as these may change with new genome versions.
18.0.2 Make your project reproducible
Reproducible projects are able to be re-run by others to obtain the same results.
The main requirements for a reproducible project are:
- The data can be freely obtained from a repository (this maybe summarized data for the purposes of data privacy).
- The code can be freely obtained from GitHub (or another similar repository).
- The software versions used to obtain the results are made clear by documentation or providing a Docker container.
- The code and data are well described and organized with a system that is consistent.
18.0.3 Have someone else review your code and data!
The best way to find out if your data are useable by others is to have someone else look it over! There are so many little details that go into your data and projects. Those details can easily lead to typos and errors upon data submission and also can lead to confusion when others (or your future self) are attempting to use that data.The best way to test if your data project is usable is to have someone else (who has not prepared the data) is able to make sense of it.
For more details on how to make data and code reproducible tips, see our Intro to Reproducibility course.
18.1 Health care data sharing tools
18.1.1 REDCap (Research Electronic Data Capture)
REDCap is a very widely used browser-based software application for managing surveys and databases. It is very often used for clinical data. In fact, it is so widely used that there is a conference dedicated to it.
REDCap allows for multi-institutional work, as well as compliance with HIPAA, 21 CFR Part 11 for data for the FDA, FISMA for government data, HIPAA, and GDPR for data for the European Union. It was developed by a team at Vanderbilt University in 2004. It is not open-source, however it is free to use for non-commercial research (redcap_2022?).
You can find out more about how to use REDCap at the REDCap website which includes instructional videos and other resources.
There are several things to keep in mind when using REDCap from an ethical standpoint.
- Roles
REDCap allows for various roles to be established for users on a project. Thus access to certain data and tasks can be restricted to certain individuals. As described previously, it is a good idea to restrict access to the smallest number of individuals necessary.
You can modify these roles using the User Rights
menu.
This will first show you who has what role on the project and their rights. You can click on an individual role to modify it.
These roles should be verified by your institutional review board (IRB) before beginning a study. Changes to roles should also be reviewed by your IRB.
- Reports
Reports that are exported can be customized to only show data that should be shared with the individual that you plan to share with. Please see the section on de-identification to better understand what data you might want to be restrictive about sharing. Again, the way you intend to share your data should be reviewed by your IRB before you begin your study.
For example, you might remove the dates from the following report:
- Auditing
REDCap keeps track of all data modifications, as well as data exports or report generations, in addition to keeping track of who performs those actions. This can be helpful for checking what has happened and when, in case anything happens that is unexpected or unintended. This is also great from a reproducibility or transparency standpoint - you have a record of any modifications to the data. This information can be obtained from the logging
menu.
- Keep instruments short
If your instruments are too long, this can result in accidentally sharing data that you don’t intend to, simply because you have more data to sift through. This also makes it easier to generate reports only on specific data that you would like to share.
- Data can be locked
You can protect your data from accidentally being modified by locking specific data. Furthermore, at later stages of the project the data can no longer be modified.
Keep in mind that your institution likely has their own guidelines for how to use REDCap should you decide to use it. Also remember to verify what you plan to do with your institutional review board (IRB) before you begin the study.
Disclaimer: The thoughts and ideas presented in this course are not official NIH guidance and are not a substituted for legal or ethical advice and are only meant to give you a starting point for gathering information data management.