Package Management for R and Python

Reminder

This workshop adheres to the DaSL Learning Community Participation Guidelines:

Participation Guidelines

Please be respectful of your fellow learners and help each other learn.

Remember, it’s dangerous to learn alone! So partner up with someone, it’s fun to learn together.

Introduction (us)

  • Ted Laderas, Director of Training / Community, OCDO
  • Taylor Firman, Research Informatics Manager, OCDO
  • Emma Bishop, Research Informatics Data Scientist, OCDO

Introduction (you)

Introduce yourself live or in chat:

  • Your name + pronouns / Your group
  • What about package management do you find confusing?
  • What will you do with the extra evening hours after Daylight Savings Time?

Hit Record in Teams

Announcements

TL; DR

  • Software environments are a spectrum
  • Package Managers make your work reproducible
  • uv/rv require you to be proactive (start from scratch)
  • If you’re in the middle of an analysis, look at renv::snapshot()/requirements.txt as good starting points
  • Binder lets you make a pre configured docker container with JupyterLab/RStudio
  • Dockerfiles are the gold standard, but seek help making them

Learning Objectives

  • Explain why package management is complicated for both Python and R
  • Define a software environment and how it enables reproducible and portable analysis
  • Manage software packages reproducibly for an analysis using uv (Python)
  • Manage software packages reproducibly for an analysis using rv (R)
  • Explain what Binder and Dockerfiles are

Keep in Mind

  • The tools themselves are not necessarily easy to use
    • Require some command-line knowledge
  • I spent about 20 hours trying to get exercises up and running on Posit Cloud
    • Had issues with R (restoring took 20 minutes)
  • I will mostly demo on my own machine
  • At the end is a list of software you need to install

What is a reproducible environment?

  • Install packages with annotations, run anywhere
  • Fixed to a particular version of Python or R
  • Packages are fixed/pinned to specific version numbers

What: Reproducible Environments

In order of complexity:

graph LR
A[Lockfile] --> B
B[Binder Ready] --> C
C[Dockerfile]

Tradeoff in terms of work on your side / ease on their side

  1. Lockfile (uv/rv)
  2. Binder Ready
  3. Dockerfile (The most work)

Why use package managers?

  • Installing the right set of packages can be a pain to reproduce (Dependency Hell)
    • Two different packages will require conflcting versions of another package
  • Future you, different laptop
  • Collaboration with others
  • Dissemination and sharing (see Making Code Ready for Publication)

What: Lockfiles and Environments

  • Lockfiles are a recipe for recreating an environment
  • We use tools to recreate them:
    • R: renv and rv
    • Python: venv and uv
  • Environments live in our folder isolated from everything else
    • Versions of R/Python
    • Libraries of packages

uv/rv: State of the art package management

  • Both work on the command line
  • Much faster to install from binaries

What: Lockfile

my_project/              ## Top level
├── rv.lock              ## R
├── rproject.toml
├── uv.lock              ## Python
├── pyproject.toml
  • List of packages and versions that you used in analysis

Lockfile Examples

{
  "R": {
    "Version": "4.2.3",
    "Repositories": [
      {
        "Name": "CRAN",
        "URL": "https://cloud.r-project.org"
      }
    ]
  },
  "Packages": {
    "markdown": {
      "Package": "markdown",
      "Version": "1.0",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "4584a57f565dd7987d59dda3a02cfb41"
    },
    "mime": {
      "Package": "mime",
      "Version": "0.7",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "908d95ccbfd1dd274073ef07a7c93934"
    }
  }
}
version = 1
revision = 1
requires-python = ">=3.13.2"
resolution-markers = [
    "python_full_version >= '3.14' and sys_platform == 'win32'",
    "python_full_version >= '3.14' and sys_platform == 'emscripten'",
    "python_full_version >= '3.14' and sys_platform != 'emscripten' and sys_platform != 'win32'",
    "python_full_version < '3.14' and sys_platform == 'win32'",
    "python_full_version < '3.14' and sys_platform == 'emscripten'",
    "python_full_version < '3.14' and sys_platform != 'emscripten' and sys_platform != 'win32'",
]

[[package]]
name = "branca"
version = "0.8.2"
source = { registry = "https://pypi.org/simple" }
dependencies = [
    { name = "jinja2" },
]
sdist = { url = "https://files.pythonhosted.org/packages/32/14/9d409124bda3f4ab7af3802aba07181d1fd56aa96cc4b999faea6a27a0d2/branca-0.8.2.tar.gz", hash = "sha256:e5040f4c286e973658c27de9225c1a5a7356dd0702a7c8d84c0f0dfbde388fe7", size = 27890 }
wheels = [
    { url = "https://files.pythonhosted.org/packages/7e/50/fc9680058e63161f2f63165b84c957a0df1415431104c408e8104a3a18ef/branca-0.8.2-py3-none-any.whl", hash = "sha256:2ebaef3983e3312733c1ae2b793b0a8ba3e1c4edeb7598e10328505280cf2f7c", size = 26193 },
]

Why not Conda?

Anaconda is charging institutions for using their forge - be aware that you will need to pay charges or change your forge to the Fred Hutch version.

For more info: https://conda-forge.fredhutch.org/

Exercise: Explore a repository set up to be reproducible

  • R (GitHub Link
  • R users: Look at both the rproject.toml and rv.lock file
  • Python (GitHub Link)
  • Python users: Look at both the pyproject.toml and uv.lock file
  • Why are there two different lockfiles? Any Guesses?

rproject.toml / pyproject.toml

  • Human readable and writable
  • Get translated to rv.lock and uv.lock
[project]
name = "package-management-r-python"
r_version = "4.4"

repositories = [
  {alias = "PPM", url = "https://packagemanager.posit.co/cran/latest"},
]

dependencies = [
    "tidyverse",
    "sf",
]
[project]
name = "package-management-r-python"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.11"
dependencies = [
    "pandas>=3.0.0",
]

rv.lock / uv.lock

  • Generated by rv and uv
  • Not meant to be modified by user
  • Contains the dependency tree for installing your packages

How do you restore an environment?

  • Run uv sync and rv sync in your project folder
  • Requires you to have the version of R/Python installed

Demo: Clone and restore an environment

How do we get started?

Two Ways

  • Start from scratch with rv/uv
  • Make a requirements.txt/renv.lock file and migrate

uv and rv: start from scratch

  • Both uv and rv expect you to start from scratch
  • Initialize your folder with uv init/rv init
  • Adds a .rv/.uv directory
  • See Making Code Ready for Publication if you are in the middle of an analysis project

Adding packages using uv add/rv add

  • You need to declare the packages you want to use
    • uv add pandas
    • rv add tidyverse
  • uv / rv will solve the dependency tree and update
  • rproject.toml or pyproject.toml and the lock files

Example: Making a lockfile

  1. rv init
  2. rv add dplyr
  3. (Optional) git add . and git commit everything

What if you’re already in the middle of analysis?

  • Here’s some guidance on how to migrate your project

Making a Lockfile from your current project

  • Usually done after you execute a notebook
  • Takes the packages you have loaded into memory and then writes them to lockfile with versions.

Put this code at the bottom of your notebook.

# pip install session-info
# uv add session-info
import session_info as si
# put this at the end of your notebook
si.show(na=True, os=True, cpu=False, 
    jupyter=None, dependencies=None,  
    write_req_file=True, 
    req_file_name="requirements.txt")
# install.packages("renv")
renv::snapshot()
  • renv will scan your project and note package versions
  • Can use the renv package to generate a current list of packages
  • Does not require a virtual environment to be initialized
  • Generates renv.lock

Advanced Topics

What: Binder Ready Repository

A special way to share your analysis

  • Put your project on GitHub
  • Can plug your repository link into mybinder.org
  • Generates JupyterLab / RStudio / Shiny Server instance

What: Binder.org

How does Binder work?

  • Launches your analysis in a container (Dockerizes your analysis)
  • Uses requirements.txt (Python), environment.yml (Conda) or install.R (R) or Dockerfiles in your repository
  • Installs relevant packages and dependencies into a docker image

install.R using renv (put in your top directory)

install.packages("renv")
renv::restore()

Dockerfile

  • A precise list of instructions to install your computational environment
  • More useful if you are distributing software.
  • More portable across systems other than Binder
  • Takes a lot of work, builds on the work of others

Dockerfile Tips

  • Don’t try to create Dockerfiles from scratch!
    • WILDS Docker Library - Dockerfiles for a number of genomics/bioinformatics packages
    • Research Computing Data House Calls!
    • Other Sources: BioC, Rocker Project, WILDS Docker Library
    • Use https://repo2docker.readthedocs.io/en/latest/
      • https://repo2docker.readthedocs.io/en/latest/configuration/#config-files

Dockerfile example

FROM rocker/geospatial:4.5.0

LABEL org.opencontainers.image.licenses="GPL-2.0-or-later" \
      org.opencontainers.image.source="https://github.com/achubaty/rocker-files" \
      org.opencontainers.image.vendor="FOR-CAST Research & Analytics" \
      org.opencontainers.image.authors="achubaty@for-cast.ca"

COPY scripts/* /rocker-files_scripts/

RUN /rocker-files_scripts/install_additional_libs.sh
RUN /rocker-files_scripts/install_geospatial_extras.sh
RUN /rocker-files_scripts/install_geospatial_R.sh

https://github.com/achubaty/rocker-files/blob/main/dockerfiles/r-spatial-base_4.5.Dockerfile

WILDS WDL Library Example

https://github.com/getwilds/wilds-docker-library/blob/main/bcftools/Dockerfile_latest

Using Containers on HPC

TL; DR

  • Software environments are a spectrum
  • Package Managers make your work reproducible
  • uv/rv require you to be proactive (start from scratch)
  • If you’re in the middle of an analysis, look at renv::snapshot()/requirements.txt as good starting points
  • Binder lets you make a pre configured docker container with JupyterLab/RStudio
  • Dockerfiles are the gold standard, but seek help making them

What you need to reproduce the examples

For R:

  1. Install rv
  2. Install rig to manage different versions of R
    1. Mac installer: https://github.com/r-lib/rig/releases/download/v0.7.1/rig-0.7.1-macOS-x86_64.pkg
    2. PC installer: https://github.com/r-lib/rig/releases/download/v0.7.1/rig-windows-0.7.1.exe
  1. rig add 4.5.2 to add R 4.5.2 on your system
  1. Install Positron
  2. Clone or download http://github.com/fhdsl/mcr_example_r
  3. Open the folder in Positron (File >> Open Folder)

For Python

  1. Install Python 3.13
  2. Install uv
    1. Mac Instructions (I use homebrew)
    2. PC Instructions
  3. Install Positron
  4. Clone or download http://github.com/fhdsl/mcr_example_python
  5. Open the folder in Positron (File >> Open Folder)