Training Resources for Data Science

In-Person Training Opportunities Overview

Season 2: Winter 2024

  Beginner Intermediate Advanced
Programming Intro to R    
Rigorous Science Introduction to Git Collaborative Git and GitHub  
Scalable Computing Introduction to Command Line Cluster 101  

Self-service Training Materials Overview

Course Name Description Link(s)
Cluster 101 Guide Intro to using the Fred Hutch HPC cluster for new or experienced users. Provides a certification option. With Certification or Without Certification
WDL 101 Intro to using Cromwell to run WDL workflows at the Fred Hutch Course link
Code Review Leading a lab with novice or experienced code writers and users? Either way, our Code Review guidance materials include helpful suggestions for various types of lab members, expertise and group dynamics. Course Link
NIH Data Sharing We have created and are actively developing a guide that walks you through the process of complying with the new 2023 NIH Data Sharing Policy. Guide Link

Course Descriptions and Details

Introduction to R

You will learn the fundamentals of R, a statistical programming language, and use it to wrangle data for analysis and visualization. The programming skills you will learn are transferable to learn more about R independently and other high-level languages such as Python. At the end of the class, you will be reproducing analysis from a scientific publication!

Targeted audience: Researchers who want to do more with their data analyses and visualizations. This course is appropriate for those who want to learn coding for the first time, or have explored programming and want to focus on fundamentals in R.

Commitment: 6 weekly 1.5 hour classes, with encouraged 1-2 hours of practice weekly. If desired, the course can be taken in a modular approach, in which you can pick and choose topics of interest.

Course dates: Noon - 1:30pm PT on January 22, 29, February 5, 12, and 26, and March 4. Register here.

Introduction to Command Line

Fluency in programming and data science requires using computer software from the Command Line, a text-based way of controlling the computer. You will go on a guided under-the-hood tour behind the graphical interface we typically use: you will learn how to interact and manipulate files, folders, and software via the Command Line.

Targeted audience: Researchers who want to use scientific software launched from the command line, want to use a high-performance cluster computing environment, or want to use a cloud computing environment.

Commitment: A 1.5 hour workshop.

Workshop date: Noon - 1:30pm PT on January 24. Register here.

Cluster 101

Many scientific computing tasks cannot be done locally on a personal computer due to constraints in computation, data, and memory. In this workshop, you will learn how to connect to the Fred Hutch SLURM high performance cluster to transfer files, load scientific software, compute interactively, and launch jobs!

Targeted audience: Researchers who want to use Fred Hutch’s SLURM high performance cluster to run software and analysis at scale.

Pre-requisites: Completion of Intro to Command Line workshop or demonstrating competency.

Commitment: A 1.5 hour workshop.

Workshop date: Noon - 1:30pm PT on January 31. Register here.

Introduction to Git

You will learn how to use Git, a version control system that is the primary means of doing reproducible and collaborative research. You will use Git from the command line to document the history of your code, create different versions of your code, and share your code with an audience via GitHub!

Targeted audience: Researchers who want to keep track the history of their code at a professional standard, and share it with an audience.

Prerequisites: Completion of Intro to Command Line or demonstrating competency.

Commitment: A 1.5 hour workshop.

Workshop date: Noon - 1:30pm PT on February 7th. Register here.

Collaborative Git and GitHub

You will expand your current knowledge of Git and GitHub to help your research be more collaborative, reproducible, and transparent. You will learn how to develop your work independently on a “branch” before “merging” it back to a shared repository, and resolve any conflicts along the way. Then, you will learn about the pull request model of collaboration on GitHub and how to conduct code reviews.

Targeted audience: Researchers who want to work on a code base collaboratively in a version-controlled manner.

Pre-requisites: Completion of Intro to Git seminar or demonstrating competency.

Commitment: A 1.5 hour workshop.

Workshop date: Noon - 1:30pm PT on February 14th. Register here.

Cluster 101 Guide

We collaborated with SciComp and created a course called Cluster 101, to introduce new and experienced high performance computing users to the Fred Hutch cluster. This course can be taken anytime, anywhere, for free (don’t worry just set the Leanpub cost to zero) and it’ll help you check to make sure your account is set up to use the cluster and either help you check to make sure you understand the basics of how to use a cluster, or help you get up to speed. It can even give you a certification for reporting between staff and lab leaders if need be.

  • If you need/want to get a certification, please take the course through Leanpub at this link.

  • If you do not need the certification or want to bookmark the course for future reference, you can find the material at this link.

If you take this course and want to give us feedback or would like to learn more about it, you can share your thoughts in Slack in the #ask-dasl channel) or you can file an issue on the course’s GitHub repository.

WDL 101

Our next emerging guide is WDL 101: Using Cromwell at the Fred Hutch which will help users leverage our preconfigured software to allow you to run and manage all of your computing jobs, as well as run workflows written in WDL (a widely used open specification for workflow description), even if they use Docker containers! This is an excellent tool for researchers who do not use large scale computing resources often or who would want a method to simply upload a file and use a website to submit jobs to the cluster. Also, for researchers who do leverage amount of cluster resources, this approach can help you make your computing more efficient, help you to scale up your pleasantly parallel work and shift to using containers and structuring tasks appropriately to facilitate moving to the cloud when needed. This guide is under active development as of Nov 2022.

If you take this guide and want to give us feedback or would like to learn more about it, you can share your thoughts in Slack in the #ask-dasl channel) or you can file an issue on the guide’s GitHub repository.

Code Review

Leading a lab with novice or experienced code writers and users? Either way, see our Code Review materials that include helpful suggestions for various types of lab members, expertise and group dynamics.

If you take this course and want to give us feedback or would like to learn more about it, you can share your thoughts in Slack in the #ask-dasl channel) or you can file an issue on the course’s GitHub repository.

NIH Data Sharing

We have created and are actively developing a guide you can find here that walks you through the process of complying with the new 2023 NIH Data Sharing Policy. We have also created the DMS Helper App to make filling in and downloading your data sharing plan easier. >If you take this course and want to give us feedback or would like to learn more about it, you can share your thoughts in Slack in the #ask-dasl channel) or you can fill out our Google Feedback Form.

Full List of Self-Service Training Resources

FH DaSL staff have developed many training resources as part of collaborations and efforts. Various resources spanning a wide range of data science and tool-specific topics that we have previously developed are available from the sources listed below.

(in alphabetical order)

AnVIL

AnVIL is a computing platform that enables researchers to analyze controlled access data sets in a cloud computing environment. It has loads of training materials to support those using it!

Code Review Guidance for Research Labs

Leading a lab with novice or experienced code writers and users? Either way, see our Code Review materials that include helpful suggestions for various types of lab members, expertise and group dynamics.

DataTrail

The DataTrail courses are free and designed to help those with less familiarity with computers and technology become savvy data scientists. It includes the technological data science fundamentals but also information on how to network and other accompanying and necessary skills for jobs in data science.

ITCR Training Network

The ITCR Training Network is an effort to catalyze cancer informatics research through training opportunities. It has online courses that are available for free and/or for certification, but also hosts synchronous training events and workshops related to data science in cancer research. Links to all the current ITCR courses can be found here

Johns Hopkins Data Science Courses

There are a lot of helpful resources for data science that we made as a part of Johns Hopkins. These courses cover various applications and tools of data science, mostly focused on using R and the Tidyverse.

Open Case studies

The Open Case Studies project can be used by educators and learners alike to help people learn how apply data science to real-life data.