Title

Data File Size Details

Here you can find more specific information about the file sizes for the types of data commonly generated at Fred Hutch. If you’d like to learn more about the basics of data sizes and computing capacity, please take a look at this class on Computing for Cancer Informatics from the Informatics Technology for Cancer Research (ITCR) Training Network (ITN).

Genomics Data

Matt Fitzgibbon and Andy Marty from Genomics Shared Resources have put together a table of file sizes generated by common genomics assays done at Fred Hutch. These estimates are only approximate, as the actual file sizes can vary considerably. Per-samples sizes are averaged from samples of at least three representative runs of the given type (except for 10x Multiome where two runs were checked).

Assay File Type Per-sample Size Per-run Size Public Repository Private Repository Notes
Bulk RNA-seq Paired Fastq 2-4G highly variable GEO/SRA dbGaP/SRA Depends on library prep & goals
RNA Exome Paired Fastq 3G highly variable GEO/SRA dbGaP/SRA
Whole Exome Paired Fastq 3G highly variable GEO/SRA dbGaP/SRA HS platform dependent
CRISPR Single Fastq ≥500M highly variable GEO/SRA dbGaP/SRA sgRNA library dependent
CUT&RUN Paired Fastq ≥500M highly variable GEO/SRA dbGaP/SRA Ab dependent
CUT&Tag Paired Fastq ≥500M highly variable GEO/SRA dbGaP/SRA Ab dependent
ChIP-seq Fastq 0.5-5G highly variable GEO/SRA dbGaP/SRA Ab dependent
ATAC-seq Fastq 3-5G highly variable GEO/SRA dbGaP/SRA
10x scRNA-seq Paired Fastq 10G highly variable GEO/SRA dbGaP/SRA Target cell number dependent
10x Multiome Paired Fastq ≥20G highly variable GEO/SRA dbGaP/SRA Target nuclei number dependent
10x Visium Paired Fastq ≥5G highly variable GEO/SRA dbGaP/SRA Spots under tissue dependent
Small Genome Paired Fastq ≥2G highly variable GEO/SRA N/A Genome size dependent
PacBio Amplicon CCS BAM 0.5-20G highly variable GEO/SRA N/A Amplicon size & target depth dependent
PacBio Small Genome CCS BAM highly variable highly variable GEO/SRA N/A Genome size dependent

Imaging Data

File sizes for medical imaging data vary greatly depending on both the technology used and the organ being imaged. These are some general estimates you can use as a guideline when considering your data management and storage needs. These tables are borrowed from the ITN Computing for Cancer Informatics Course.

Here is an table of average file sizes for various medical imaging modalities from Liu et al. (2017):

Table of file types for imaging data, most modalities have files in the range of MB to GB. Note that these are approximate values. [source]

Note that depending on the study requirements, several images may be needed for each sample. Thus data storage needs can add up quickly.

Example table of overall file storage needs for samples in imaging studies. [source]

Clinical Data

This information is borrowed from the ITN Computing for Cancer Informatics Course.

Really large clinical datasets can also produce sizable file sizes. For example the Healthcare Cost and Utilization Project (HCUP) National (Nationwide) Inpatient Sample (NIS) contains data on more than seven million hospital stays in the United States with regional information.

According to the NIS website it “enables analyses of rare conditions, uncommon treatments, and special populations” (NIS Database Documentation n.d.).

Looking at the file sizes for the NIS data for different states across years, you can see that there are files for some states, such as California as large as 24,000 MB or 2.4 GB (NIS Database Documentation n.d.). You can see how this could add up across years and states quite quickly.

Table of file sizes for the Healthcare Cost and Utilization Project (HCUP) National (Nationwide) Inpatient Sample (NIS) of data from different years and states.