Making Code Ready for Publication

Reminder

This workshop adheres to the DaSL Learning Community Participation Guidelines:

Participation Guidelines

Please be respectful of your fellow learners and help each other learn.

Remember, it’s dangerous to learn alone! So partner up with someone, it’s fun to learn together.

Introduction (you)

Introduce yourself live or in chat:

  • Your name + pronouns
  • Your group
  • How do you use spreadsheets in your work?
  • Favorite winter activity

Hit Record in Teams

TL; DR

  • Reproducibility is a spectrum
  • Use separate folders for each of your projects
  • Organize code and Data
  • Generate lockfiles
  • Get Help from DaSL / Open Sci Organizations
  • Deposit data where required by papers

Outline

  • Why make your code ready for publication?
  • What do you need to make it ready?
  • How and where do you make it available?

Who this is for

  • Majority of us want to share analyses, not software
  • Leverage some principles from software packaging to share scripts and notebooks
  • Software packaging is its own topic

“It worked on my machine”

  • What are some issues that you’ve encountered in sharing your analyses with other people?

Why?

Why: Documentation is Important

  • Do it for Future You
  • Others in your lab
  • Others in your field

Why: Reproducibility Matters

Why: Reproducibility is a Spectrum

  • Do what you can
  • Providing a good framework for running analyses
  • Who is going to look at your code?
  • Where are you going to share it?

Why: Reproducible Analyses are not perfect

  • They only need to be able to run on another machine
  • Don’t let perfect be the enemy of good

Why: Languages are Moving Targets

  • Packages may depend on certain versions of Python/R
  • Dependency Hell
  • Need a way to “freeze” or “pin” versions used in analysis
    • Language Versions
    • Package Versions

Why: Reproducibility is an iterative process

  • When possible, start from the beginning
  • Use package management and environments from the start
    • rv / uv (in Package management session)
  • Test out running scripts and notebooks as you go

What: Parts of a Reproducible Project

What: Minimum Information for Analyses

  • Focus on Data Analysis in R / Python
  • Organize your analysis in a folder and share in a repository

What: Separate Folders

  • Ensures portability across platforms
  • If there is repeated code from another project, considering packaging at that code

What: Project Example:

my_project/                 ## Top level
├── data/                   ## Data directory
│   └── my_data.vcf  
├─- output/                 ## Share output    
└── 01_preprocessing.R      ## Scripts in order
└── 02_deseq2_analysis.qmd
└── 03_visualization.ipynb
├── renv.lock               ## R Packages 
├── requirements.txt        ## Python Packages
└── README.md

https://rstats.wtf/projects

More project examples

What: README

my_project/                ## Top level
└── README.md

workflow.png

What: README

  • First thing that people will see
    • https://github.com/biodev/HNSCC_Notebook
  • Document the basic workflow of processing
    • How does the data come together in the analysis?

Exercise: Look at a README (5 min)

Pick one of these studies:

A. Integrative Pharmacogenomics Analysis of Patient Derived Xenografts (R)

B. BeatAML2 Manuscript Workflow (R)

C. An open RNA-Seq data analysis pipeline tutorial (Python)

Try and answer this question in the Google Doc

  1. How was the README? Was it Well Organized?

What: Notebooks / Analysis Files

my_project/                ## Top level    
├── 01_preprocessing.R     ## Scripts in order
├── 02_deseq2_analysis.qmd
└── 03_visualization.ipynb
  • Easiest: place in your top folder
  • Number in order
    • 01_preprocessing.R
    • 02_deseq2_analysis.qmd
  • Be sure to include a random seed for reproducibility

What: Notebooks / Relative Paths

  • Everything should be runnable from the top folder of the project. Put data in data/ folder. Use relative paths from the top project folder:
my_data <- readr::read_csv("data/datafile.csv")
  • Ensures portability of project

What: {targets} and Workflow Builders

  • Not 100% Necessary!
  • Another layer of complexity, but can be helpful
  • Lets you chain together notebooks
  • A lot like snakemake

Targets example: https://github.com/biodev/hnscc_manuscript

Exercise: Look at How Notebooks are Organized (5 min)

Pick one of these studies:

A. Integrative Pharmacogenomics Analysis of Patient Derived Xenografts (R)

B. BeatAML2 Manuscript Workflow (R)

C. An open RNA-Seq data analysis pipeline tutorial (Python)

Try and answer this question in the Google Doc

  1. How are the Notebooks organized?

What: Know the status of your data

What: Data in a Project

my_project/                ## Top level
├── data                   ## Data directory
│   └── my_data.vcf  
  • Genomic and omics data is large
    • Raw data is not practical for GitHub (100 Mb limit)
    • Store raw files in required respositories for your field
    • Track what files were processed (manifest)
  • Supply intermediate formats used to do the analysis, if possible:
    • MAF/VCF/CSV

Reproducibility in the Genomics Era

What: Data

  • With code, share metadata - list the files you processed

    • File manifest - point towards data repositories
    • JSON files from workflows
    • Metadata / Experimental Design
      • Where does each sample fit into Experimental Design?
  • Stay tuned - we may offer a data management workshop this Summer

  • A Realistic Guide to Making Data Available Alongside Code

Exercise: Examine how data is stored (5 minutes)

Pick one of these studies:

A. Integrative Pharmacogenomics Analysis of Patient Derived Xenografts (R)

B. BeatAML2 Manuscript Workflow (R)

C. An open RNA-Seq data analysis pipeline tutorial (Python)

Try and answer this question in the Google Doc

  1. How is the data stored (or not stored)?

What: Reproducible Environment

A Reproducible Environment is a computational environment is the system where a program is run.

  • Operating System/Platform
  • Language Version
  • Package Versions
  • Software Dependencies

What: Reproducible Environments

In order of complexity:

graph LR
A[Lockfile] --> B
B[Binder Ready] --> C
C[Dockerfile]

There is a tradeoff between - Effort on your side (Lockfile is least effort) - Ease of Use on User End (Dockerfile is most effort)

What: Lockfiles and Environments

  • Lockfiles are a recipe for recreating an environment
  • We use tools to recreate them:
    • R: renv and rv
    • Python: venv and uv
  • Environments live in our folder isolated from everything else
    • Versions of R/Python
    • Libraries of packages

What: Lockfile

my_project/                ## Top level
├── renv.lock              ## R
├── requirements.txt       ## Python
  • List of packages and versions that you used in analysis
  • Talk about rv and uv in Package management session

Lockfile Examples

{
  "R": {
    "Version": "4.2.3",
    "Repositories": [
      {
        "Name": "CRAN",
        "URL": "https://cloud.r-project.org"
      }
    ]
  },
  "Packages": {
    "markdown": {
      "Package": "markdown",
      "Version": "1.0",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "4584a57f565dd7987d59dda3a02cfb41"
    },
    "mime": {
      "Package": "mime",
      "Version": "0.7",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "908d95ccbfd1dd274073ef07a7c93934"
    }
  }
}
anyio==4.11.0
appnope==0.1.4
argon2-cffi==25.1.0
argon2-cffi-bindings==25.1.0
arrow==1.3.0
asttokens==3.0.0
async-lru==2.0.5
attrs==25.3.0
babel==2.17.0
beautifulsoup4==4.13.5
bleach==6.2.0
certifi==2025.8.3
cffi==2.0.0
[...]

Exercise: Look at a lockfile (5 minutes)

Why not Conda?

Anaconda is charging institutions for using their forge - be aware that you will need to pay charges or change your forge to the Fred Hutch version.

For more info: https://conda-forge.fredhutch.org/

Exercise: Examine Reproducible Environments (5 minutes)

Pick one of these studies:

A. Integrative Pharmacogenomics Analysis of Patient Derived Xenografts (R)

B. BeatAML2 Manuscript Workflow (R)

C. An open RNA-Seq data analysis pipeline tutorial (Python)

Try and answer this question in the Google Doc

  1. How did they reproduce the software environment, or did they?

What: Reproducing environment from lockfile

  • Download folder from GitHub
  • Install language version (such as R 3.4.3 orPython 3.13) to your machine
  • Make sure that uv is installed
uv sync
install.packages("renv")
renv::restore()

What: Making a Lockfile from your current project

  • Usually done after you execute a notebook
  • Takes the packages you have loaded into memory and then writes them to lockfile with versions.

Put this code at the bottom of your notebook.

# pip install session-info
# uv add session-info
import session_info as si
# put this at the end of your notebook
si.show(na=True, os=True, cpu=False, 
    jupyter=None, dependencies=None,  
    write_req_file=True, 
    req_file_name="requirements.txt")
# install.packages("renv")
renv::snapshot()
  • renv will scan your project and note package versions
  • Can use the renv package to generate a current list of packages
  • Does not require a virtual environment to be initialized
  • Generates renv.lock

How

How and Where: Testing your shared project

  • Try downloading and installing on a different computer to make sure that you can rerun analyses
  • Take someone else through the process and test out the notebooks
  • If making binder ready: test the repository on Binder

How: Review Opportunities and Resources

How and Where: Sharing Your Analyses

  • GitHub (for code)
  • Open Science Framework (for code + data)
  • Field specific databases
  • Social Media: LinkedIn, Bluesky, etc.

Where should you share code?

Share in a public repository:

  • GitHub
  • Codeocean
  • ReadtheDocs

Be aware of file size limitations!

How: Data Repositories

  • Data repositories
    • Open Science Framework
    • Required databases (dbGAP)
    • Be aware that you will need to provide metadata
      • Experimental design
    • Be careful when sharing human subjects data
      • If unsure, schedule a Data Governance House Call

https://journals.plos.org/plosgenetics/s/recommended-repositories#loc-omics

TL; DR

  • Reproducibility is a spectrum
  • Use separate folders for each of your projects
  • Organize code and Data
  • Generate lockfiles
  • Get Help from DaSL / Open Sci Organizations
  • Deposit data where required by papers

What’s Next?

References

Advanced Topics

What: Binder Ready Repository

A special way to share your analysis

  • Put your project on GitHub
  • Can plug your repository link into mybinder.org
  • Generates JupyterLab / RStudio / Shiny Server instance

What: Binder.org

How does Binder work?

  • Launches your analysis in a container (Dockerizes your analysis)
  • Uses requirements.txt (Python), environment.yml (Conda) or install.R (R) or Dockerfiles in your repository
  • Installs relevant packages and dependencies into a docker image

install.R using renv (put in your top directory)

install.packages("renv")
renv::restore()

Exercise: Launch a Binder Repository

Some Cons about Binder

  • Currently limited to 2 Gb of memory for an instance
  • If your image files aren’t used at least once a week, they get deleted

Dockerfile

  • A precise list of instructions to install your computational environment
  • More useful if you are distributing software.
  • More portable across systems other than Binder
  • Takes a lot of work, builds on the work of others

Dockerfile example

FROM debian:bookworm-slim AS builder

RUN Rscript 

There will be a lot of crying

  • Dockerfiles are a lot of work

Dockerfile Tips

  • Don’t try to create Dockerfiles from scratch
    • Community Images: BioC, Rocker Project, WILDS Docker Library
    • Use https://repo2docker.readthedocs.io/en/latest/
      • https://repo2docker.readthedocs.io/en/latest/configuration/#config-files

Nix

  • Currently investigating Nix as a language agnostic reproducibility framework