9 Week 3: Containers

9.1 Containers

We already learned about software modules (Section 3.5) on the gizmo cluster. There is an alternative way to use software: pulling and running a software .

9.1.1 What is a Container?

A container is a self-contained unit of software. It contains everything needed to run the software on a variety of machines. If you have the container software installed on your machine, it doesn’t matter whether it is MacOS, Linux, or Windows - the container will behave consistently across different operating systems and architectures.

The container has the following contents:

Software - The software we want to run in a container. For bioinformatics work, this is usually something like an aligner like bwa, or utilities such as samtools
Software Dependencies - various software packages needed to run the software. For example, if we wanted to run tidyverse in a container, we need to have R installed in the container as well.
Filesystem - containers have their own isolated filesystem that can be connected to the “outside world” - everything outside of the container. We’ll learn more about customizing these with bind paths (Section 9.3.3).

In short, the container has everything needed to run the software. It is not a full operating system, but a smaller mini-version that cuts out a lot of cruft.

Containers are . They leverage the the file system of their host to manage files. These are called both Volumes (the Docker term) and Bind Paths (the apptainer term).

9.1.2 Docker vs. Apptainer

There are two basic ways to run Docker containers:

Using the Docker software
Using the Apptainer software (for HPC systems)

In general, Docker is used on systems where you have a high level of access to the system. This is because docker uses a special user group called docker that has essentially root level privileges. This is not something to be taken lightly.

This is not the case for HPC systems, which are shared and granting this level of access to many people is not practical. This is when we use (which used to be called Singularity), which requires a much lower level of user privileges to execute tasks. For more info, see Section 9.3 .

Be Secure

Before we get started, security is always a concern when running containers. The docker group has elevated status on a system, so we need to be careful that when we’re running them, these containers aren’t introducing any system vulnerabilities. Note that on HPC systems, the main mechanism for running containers is apptainer, which is designed to be more secure.

These are mostly important when running containers that are web-servers or part of a web stack, but it is also important to think about when running jobs on HPC.

Here are some guidelines to think about when you are working with a container.

Use vendor-specific Docker Images when possible.
Use container scanners to spot potential vulnerabilities. DockerHub has a vulnerability scanner that scans your Docker images for potential vulnerabilities. For example, the WILDS Docker library employs a vulnerability scanner and the containers are regularly patched to prevent vulnerabilities.
Avoid kitchen-sink images. One issue is when an image is built on top of many other images. It makes it really difficult to plug vulnerabilities. When in doubt, use images from trusted people and organizations. At the very least, look at the Dockerfile to see that suspicious software isn’t being installed.

9.2 Pulling a Docker Container and Running It

Let’s pull a docker container from the Docker registry. Note we have to specify docker:// when we pull the container, because Apptainer has its own internal format called SIF.

module load Apptainer/1.1.6
apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1
apptainer exec \
    --bind /path/to/data:/data \
    docker://biocontainers/samtools:v1.9-4-deb_cv1 \ 
    samtools view -c /mydata/$1 > /mydata/$1.counts.txt

9.2.1 Using a Container

In Section 9.1, we learned a little bit about using Apptainer to run a Docker container. Let’s try to pull a common container, the Genome Analysis Toolkit (GATK) and run things inside the container.

The first thing we need to do is load Apptainer:

module load Apptainer/1.1.6

Then we can pull the docker container:

apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1

We can check if we have pulled the docker image by using

apptainer cache list

Okay, now we have confirmed that we downloaded the apptainer image. Now we can try to execute things with it.

apptainer exec \
1    --bind /path/to/data:/data \
2    docker://biocontainers/samtools:v1.9-4-deb_cv1 \
3    samtools view -c /mydata/$1 > /mydata/$1.counts.txt

1: Bind path (see Section 9.3.3)
2: Docker image we have downloaded
3: samtools command to run.

It’s worth trying this once to make sure you understand how all of the pieces are connected. In general, I do recommend using a workflow runner (Section 12.3) instead, because it helps manage all of these details, and it makes reading files easier.

9.2.2 Bind Paths

One thing to keep in mind is that containers have their own filesystem. They can only read and write to folders in the external filesystem that you give them access to with bind paths. The one exception is the current working directory.

For more info about bind paths see Section 9.3.3.

9.3 Testing code in a container

I think the hardest thing about working with containers is wrapping your head around the indirectness of them. You are running software with its own internal filesystem and the challenges are getting the container to read files in folders/paths outside of its own filesytem, as well as outputting files into those outside folders.

The best way to understand containers is to open a shell in a container. Remember, containers are self-contained mini operating systems, and the most important thing to understand what they isolate from the rest of the system, and how to get files into and out of the container.

In this section, we talk about testing scripts in a container using apptainer. We use apptainer (formerly Singularity) in order to run Docker containers on a shared HPC system. This is because Docker itself requires root-level privileges, which is not secure on shared systems.

In order to do our testing, we’ll first pull the Docker container, map our bind point (so our container can access files outside of its file system), and then run scripts in the container.

Even if you aren’t going to frequently use Apptainer in your work, I recommend trying an interactive shell in a container at least once or twice to learn about the container filesystem and conceptually understand how you connect it to the external filesystem.

9.3.1 Pulling a Docker Container

Let’s pull a docker container from the Docker registry. Note we have to specify docker:// when we pull the container, because Apptainer has its own internal format called SIF.

module load Apptainer/1.1.6
apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1

9.3.2 Opening a Shell in a Container with `apptainer shell`

When you’re getting started, opening a shell using Apptainer can help you test out things like filepaths and how they’re accessed in the container. It’s hard to get an intuition for how file I/O works with containers until you can see the limited view from the container.

By default, apptainers can see your current directory and navigate to the files in it.

You can open an Apptainer shell in a container using apptainer shell. Remember to use docker:// before the container name. For example:

module load Apptainer/1.1.6
apptainer shell docker://biocontainers/samtools:v1.9-4-deb_cv1

This will load the apptainer module, and then open a Bash shell in the container using apptainer shell. Once you’re in the container, you can test code, especially seeing whether your files can be seen by the container (see Section 9.3.3). 90% of the issues with using Docker containers has to do with bind paths, so we’ll talk about that next.

Once you’re in the shell, you can take a look at where samtools is installed:

which samtools

Note that the container filesystem is isolated, and we need to explicitly build connections to it (called bind paths) to get files in and out. We’ll talk more about this in the next section.

Once we’re done testing scripts in our containers, we can exit the shell and get back into the node.

exit

9.3.3 Using bind paths in containers

One thing to keep in mind is that every container has its own filesystem. One of the hardest things to wrap your head around for containers is how their filesystems work, and how to access files that are outside of the container filesystem. We’ll call any filesystems outside of the container external filesystems to make the discussion a little easier.

By default, the containers have access to your current working directory. We could make this where our scripts live (such as /home/tladera2/), but because our data is elsewhere, we’ll need to specify that location (/fh/fast/mylab/) as well.

The main mechanism we have in Apptainer to access the external filesystem are bind paths. Much like mounting a drive, we can bind directories from the external filesystem using these bind paths.

flowchart LR
   B["External Directory - /fh/fast/mydata/"] 
   B --read--> C
   C --write--> B
   A["Container Filesystem - /mydata/"]--write-->C("--bind /fh/fast/mydata/:/mydata/")
   C --read--> A

I think of bind paths as “tunnels” that give access to particular folders in the external filesystem. Once the tunnel is open, we can access data files, process them, and save them using the bind path.

Say my data lives in /fh/fast/mydata/. Then I can specify a bind point called mydata/ in my apptainer shell and apptainer run commands so my container can access the files in that directory. Then in the container, we can access the files through the mydata/ bind path.

We can do this with the --bind option:

apptainer shell --bind /fh/fast/mydata:/mydata docker://biocontainers/samtools:v1.9-4-deb_cv1

Note that the bind syntax doesn’t have the trailing slash (/). That is, note that it is:

--bind /fh/fast/mydata: ....

Rather than

--bind /fh/fast/mydata/: ....

Now our /fh/fast/mydata/ folder will be available as /mydata/ in my container. We can read and write files to this bind point. For example while in the container’s shell, I’d refer to the .bam file /fh/fast/mydata/my_bam_file.bam as:

samtools view -c /mydata/my_bam_file.bam

Opening a Shell in a Docker Container with Docker

For the most part, due to security reasons, we don’t use docker on HPC systems. In short, the docker group essentially has root-level access to the machine, and it’s not a good for security on a shared resource like an HPC.

However, if you have admin level access (for example, on your own laptop), you can open up an interactive shell with docker run -it:

docker run -it biocontainers/samtools:v1.9-4-deb_cv1 /bin/bash

This will open a bash shell much like apptainer shell. Note that volumes (the docker equivalent of bind paths) are specified differently in Docker compared to Apptainer.

WDL makes this way easier

A major point of failure with Apptainer scripting is when our scripts aren’t using the right bind paths. It becomes even more complicated when you are running multiple steps.

This is one reason we recommend writing WDL Workflows and a (such as or Sprocket) to run your workflows. You don’t have to worry that your bind points are setup correctly, because they are handled by the workflow manager.

9.3.4 Executing in the Apptainer Shell

Ok, now we have a bind point, so now we can test our script in the shell. For example, we can see if we are invoking samtools in the correct way and that our bind points work.

samtools view -c /mydata/my_bam_file.bam > /mydata/bam_counts.txt

Again, trying out scripts in the container is the best way to understand what the container can and can’t see.

9.3.5 Exiting the container when you’re done

You can exit, like any shell you open. You should be out of the container. Confirm by using hostname to make sure you’re out of the container.

9.3.6 Testing outside of the container

Let’s take everything that we learned and put it in a script that we can run on the HPC:

# Script to samtools view -c an input file:
# Usage: ./run_sam.sh <my_bam_file.bam>
# Outputs a count file: my_bam_file.bam.counts.txt
#!/bin/bash
module load Apptainer/1.1.6
apptainer run --bind /fh/fast/mydata:/mydata \ 
    docker://biocontainers/samtools:v1.9-4-deb_cv1 \ 
    samtools view -c /mydata/$1 > /mydata/$1.counts.txt
#apptainer cache clean
module purge

We can use this script by the following command:

./run_sam.sh chr1.bam

And it will output a file called chr1.bam.counts.txt.

Apptainer Cache

The apptainer cache is where your docker images live. They are translated to the native apptainer .sif format.

You can see what’s in your cache by using

apptainer cache list

By default the cache lives at ~/.apptainer/cache.

If you need to clear out the cache, you can run

apptainer cache clean

to clear out the cache.

There are a number of environment variables (Section 15.3) that can be set, including login tokens for pulling from a private registry. More information is here.

9.3.7 The WILDS Docker Library

The Data Science Lab has a set of Docker containers for common Bioinformatics tasks available in the WILDS Docker Library. These include:

bwa mem
samtools
gatk
bcftools
manta
cnvkit
deseq2

Among many others. Be sure to check it out before you start building your own containers.

9.4 What’s Next?

Next week we will discuss using MiniWDL (a workflow manager) to process files through multi-step workflows in and run WDL files using PROOF.