Chapter 2 DaSEH Infrastructure Leaf

2.1 Learning Objectives

In this chapter we will describe the overall infrastructure of the DaSEH Project, which includes:

2.2 DaSEH Website

The DaSEH website describes the mission of the DaSEH project, information for learners to apply to participate, a summary of how the course works, links to our learning content, and links to additional resources. It is built using R Markdown on GitHub at this repository: https://github.com/fhdsl/daseh. It includes Google Forms to gather information or feedback from potential participants. It has several relevant pages:

  1. Home: explains what the course is for and our mission: https://daseh.org/index.html

  2. Apply: links to a form to apply to DaSEH – The link to Google Forms changes periodically, but you can view a static version here

  3. Course Content

  4. Support

    • Learning Resources: links to extra learning and help, including external vignettes, cheatsheets, and videos of our previous lectures: https://daseh.org/resources.html

2.2.1 Recordings

We like to share recordings of our sessions (using Zoom) with registered learners immediately to:

  • Accommodate learners with other commitments
  • Allow learners to revisit materials for review after live sessions
  • Give learners access to content in the future as a refresher

Recordings also:

  • Accommodate independent learners more effectively
  • Support instructors (and assistants) who need to refresh their knowledge or are less familiar with the topics

2.2.2 Contact Info

The contact page on the website contains a contact email that may be used to send a message to DaSEH to ask a question or provide suggestions.

2.2.3 Feedback

We are continually striving to make our content better. Please contact us on our feedback form or create an issue on GitHub if you have ideas for suggestions for the project, notice typos or errors, or if you are interested in getting involved.

2.2.4 Survey

There will also be a survey available on our website that allows us to do research on DaSEH resource use.

The survey should take no more than 10 minutes to complete. Your feedback helps us learn more about how to improve our resources. Part of this includes getting a better understanding of who is using our resources and how so that we can better design our materials. We would greatly appreciate you filling it out if you have the time!

2.2.5 Materials

Our modules teach various topics related to data science for environmental health.

Each module has lecture slides built using ioslides and R Markdown. Most have lab activities built using R Markdown.

Raw R Markdown files (.Rmd files) for each lecture session, as well as rendered HTML and PDF versions, are available on the website and our GitHub repository. Labs and lab keys are available in raw R Markdown and rendered HTML formats.

We also include cheatsheets so that learners can review the functions that they learned that day.

image of the materials and schedule table on the DaSEH website which shows the module name, the lecture links available in HTML, PDF, and the raw Rmd file, the lab activity links including the Rmd file and  html rendered version, as well as for the key version which includes the answers and finally additional resources like cheatsheets

2.2.6 Module Summary

  1. An Introduction to R

    This module includes the history of R and how it differs from other options like Python or Stata. It prepares students for the course experience and suggestions for how to learn. It also introduces jargon such as “variable”, “row”, “column”, and “object”.

    This module does not have a lab.

  2. Basic R

    This module covers object assignment, vector creation, and simple calculations.

    The lab activity is an HTML webpage (as learners are not yet introduced to Rmd files). Learners practice performing simple mathematical calculation, assigning vectors, and checking the class and length of vectors.

  3. RStudio

    This module includes a tour of RStudio, and how to write code interactively and in scripts.

    In the lab activity, learners knit the Rmd file, add headers, and practice code chunk functionality: running chunks in the Rmd file, running all previous chunks, and creating new chunks.

  4. Reproducibility

    This module introduces concepts related to repeatability, reproducibility, and replicability. It introduces best practices for improving transparency in our code.

    The lab activity features tasks like cleaning the environment, using the set.seed() function to generate the same random numbers each time, and using the sessionInfo() function to list packages and versions used.

  5. Data Input

    This module describes how to manually import data using point-and-click methods as well as programmatically (with code) through the readr package. We cover import of .csv files and other delimited files (e.g., tab delimited files), as well as tools for importing excel files and SAS, SPSS, and Stata files. We also cover methods for checking your imported data.

    In the lab activity, learners practice the point-and-click and programmatic methods of data import using readr.

  6. Data Subsetting

    This module covers accessing specific parts of a dataset by selecting columns and filtering rows, including AND/OR logic for filtering conditions. We also cover renaming columns (using rename() from the dplyr package and clean_names() from the janitor package) and best practices for naming columns, as this can impact our ability to subset our data. We discuss why pulling the data withing a column of a data frame as a vector is often needed for mathematical operations. In addition, we cover how to arrange/sort data, remove columns, and create columns using the mutate() function.

    The lab activity covers functions like rename(), rename_with(), pull(), select(), filter() so that learners can practice renaming columns, pulling out a single column as a vector, selecting specific columns, and filtering data based on conditions. Learners also practice creating new columns with mutate().

  7. Data Summarization

    This module covers how to apply mathematical functions to get summary statistics from data including, mean, standard deviation, range, max, and min. We show how we can pull the data out as a vector to use these functions or we can use the summarize() function on columns of a data to create a new data frame with summary statistics. We also talk about the summary() function to find quantiles of data quickly.

    In the lab activity, learners find the dimensions of the data, use the count() function to summarize the data, use pull() and mathematical functions like sum() to calculate summaries of columns, as well as use the summarize() function to summarize the data.

  8. Data Classes

    This module covers information about different types of data classes and how to coerce data from one data type to another. We cover numerical data, character data, logical data, double precision data, integer data, factors, and dates. We also touch on lists and matrices.

    In the lab activity, learners check the class of several vectors including those that look very similar, such as a character and logical version of “TRUE” and “FALSE”, conversion of a vector to a different data type, and conversion of a column to a new data type within a data table.

  9. Data Cleaning

    This module covers how to find and work with missing data using the naniar package and count(), how to recode missing data or recode data as NA, and how to recode specific values of a column or create a new column based on conditions of other columns using the case_when() function. We also cover how to separate or unite columns and how to use stringr functions to help modify values or find specific values based on partial patterns.

    In the lab activity, learners evaluate the missing data within a dataset and recode messy data. For example, correcting mixed values “n, N, and”No” to indicate “no exposure”.

  10. Manipulating Data

    This module covers how to rearrange data into long or wide format. We discuss how wide format can be useful for human interpretation and how long format is typically better for analyses and data visualization in R. We also discuss how to join/combine different datasets and describe why one might want to do this.

    In the lab activity, learners use functions like pivot_longer() and pivot_wider() to change the shape of a dataset. They also practice doing joins of datasets together and comparing how the different joining functions work.

  11. Intro to Data Visualization

    This module gives learners a taste of data visualization by using a point-and-click option from the esquisse package. This can help learners quickly attempt data visualizations and copy data visualization code.

    In the lab activity, learners get to try out creating different plots with esquisse.

  12. Data Visualization

    This module dives deeper into best practices for data visualization, how to make a ggplot2 plot, and more customization options for creating data visualizations, such as using themes, how to troubleshoot common issues in visualizations, as well as how to make plots interactive or how to combine plots using other packages like plotly and patchwork.

    In the lab activity, learners practice making plots directly with ggplot2, practice applying new themes to their plots, and add facets to their plots.

  13. Factors

    This module covers why factors require special handling and allow for meaningful ordering of summaries, analyses, and visualizations. We show how to use the forecats package to reorder factor variables.

    In the lab activity, learners convert a variable to a factor and specify new levels for the variable. They also discover how this changes the order of the variable values within data summaries and plots, when compared to character type data.

  14. Statistics

    In this module, we describe how statistical tests like t-tests, correlation, and regression are performed within R. We do not focus on the statistical interpretation, but rather how one can use R tools to perform statistical test and get the results.

    In the lab activity, learners perform a correlation test between two vectors, perform a t-test, and perform regressions (including a logistic regression).

  15. Data Output

    In this module, learners discover that they can save their processed data as .rds (R native) or .csv files so that they don’t have to rerun processing on data (especially if it is large) or if they want to share data with others.

    In the lab activity, learners write a csv and RDS file. They also read an .rds file back into R.

  16. Functions

    In this final module, learners discover the power of writing their own functions. They also learn about using apply functions and across() to iteratively apply the same function across different columns within a data frame. We discuss why reducing repetition can improve the quality and efficiency of their code.

    In the lab activity, learners create their own simple functions and use the across() function to summarize different columns of a dataset, as well as apply a new function on the specific columns of dataset.

See DaSEH in the Classroom for more information about the timing/length of these modules.

2.3 Slide Structure

Overall our slide structure involves the following components:

  1. A recap of the previous day (where appropriate)

  2. Context about why the particular topic matters or why someone might need to know the material

  3. Overviews of functions and the typical command structure to use such functions

  4. Formative Gut Checks to help gauge student comprehension and to help learners stay engaged

    • For Gut Checks, it is good to keep the question very simple and fundamental. That makes people feel successful as they work engage with the lecture and helps reinforce the most fundamental information.
  5. Examples using simple data

  6. Applied examples with real data (where possible)

    • Sometimes it is difficult to get a dataset that suits all your needs. In those cases, we make it clear that the data was created by us. This data is shared on our website for easy access and use.
  7. Cautionary examples of what can go wrong

  8. A summary of what we discussed

    • We also include links to other resources, such as Open Case Studies or Posit Cheatsheets

2.3.1 Example Recap Slide

An example of a recap slide which shows previous functions, syntax, and general principles or practices

2.3.2 Example Context Slide

An example of a context slide which shows why it is useful to learn about the dplyr package and why it is named dplyr

2.3.3 Example Overview Slide

An example of a general structure summary for the mutate function showing what each part of a command using function might include

2.3.4 Example Gut Check Slide

image of a gut check from the basic R lecture that says What function would ne useful for gettina vector version of a column, a pull or b select

2.3.5 Example of Simple Data Slide

image of a slide showing how to maninuplate the data to reformat it from wide to long using the pivot longer function with a small dataset

2.3.6 Example of Real Data Slide

image of a slide showing how to maninuplate real data to reformat it from wide to long using the pivot longer function with a more complicated dataset

2.3.7 Example of What Can Go Wrong Slide

image of a slide showing how using quotes around column names in filter can have unexpected results

2.3.8 Example Summary Slide

image of an example summary slide which shows functions from part of the subsetting lecture with simple explanations such as select and filter, it also includes cautions about avoiding quotes when using filter

2.4 DaSEH GitHub Organization

GitHub is a website and cloud service that enables developers to store, manage, and track changes to their code. DaSEH uses GitHub for both development and distribution purposes. Users have complete access to all DaSEH course materials at our course website repository, feel free to explore!

This diagram explains the folder structure of our materials:

The folders on github of our resources, it shows that the main folders of interests are the data folder, which has all the data from the course, and the modules folder which has all the labs, lectures, and homework content, in addition, most of the free files make up the website pages, beyond that there are other folders that help in creating the website or rendering the content, including the docker folder for creating the environment to render everything on GitHub, the  github workflows folder has files for building the website from the OTTR project, the docs, site_libs, and help_files folders have files from rendering, the images folder has images for the course and website content, the resources plage has files for helping wwith settings for rendering, like the dictionary for checking spelling, the scripts page has scripts for automating aspects like making lab files for students without the answer form the key

2.5 Codeathon Resources

In addition to the modules covered in the online course, we cover some content during our codeathon. You can find these materials on Google Drive, including our introduction, conclusion slides, and slides to accompany “stand-ups” to get people talking about their work.

2.5.1 Codeathon Slides

Our Codeathon slides cover:

  • Version Control

    • This resources discusses the basics of using GitHub/Git and R. It includes how to make a new repository, how to upload files to GitHub, like data or code, how to set up usage of git on your local computer, and how to push changes from your local computer to GitHub.
  • Open Data

    • This resource covers places to find data to perform environmental health data science research.
  • Code Review

    • This resource includes the basics of how to do code review on GitHub, including the use of GitHub issues.
  • Data Ethics

    • This covers information about AI use, data protection, data privacy, data provenance and more.
  • Mapping mini-module

    • This covers different tools available to make maps in R, the basics for making map plots, as well as tips and resources.

2.5.2 Codeathon Schedule

Here you can see the schedule from one of our past DaSEH codeathons.

2.5.3 Stand-ups

To facilitate participants communicating with us and one another, we held regular “stand-ups” where participants stood in a circle and shared progress and challenges. We modeled these brief check-ins after stand-up meetings in the software development sphere.

These stand-up check-ins are opportunities for participants to pause and reflect on how their work is going and share that information with others. During the codeathon, learners would take turns talking about what was going well, what was not going well, what they were currently working on, and if they could use help or if they were interested in collaborating.

We often discovered that many of the participants would have common interests or challenges, leading to peer support (e.g., sharing resources) or tackling a problem together. We would encourage learners to help others, as this process can really solidify understanding of a topic. We believe this can also lead to new collaborations between professional learners.

Stand-ups also allowed us (the instructors) to determine who might need extra assistance and what kind of resources we might share with them.

Importantly, stand-ups also helped participants get to know one another, think about next steps, consider pivoting if needed, and take a break from mentally challenging tasks. These are key decision making skills in real environmental health settings we felt were important to discuss.

Feedback from participants suggests that they really enjoyed the stand-up process.

2.6 Data

All of the data used in the modules are hosted in the GitHub repository within the data folder. This data can be reused separately outside of our our module materials.

To use the DaSEH data, you can:

  1. Interface with the data in R via URL
  2. Download data via URL on the website Data page
  3. Download the full DaSEH GitHub repository and find the data in the folder or directory called data.
  4. Directly download individual files by searching through our data directory on GitHub, clicking on a data file and the clicking the download raw file button.

Visual showing the download raw file button on github for one of the data files

Information about the which modules certain data are used in, as well as information about the data sources, can be found here. This page also includes other relevant datasets which can be used to develop or adapt course materials, such as for homework assignments, extra practice, or projects.

2.6.1 Adapting Materials

If you are interested in creating a similar website from our course materials or you want to take our lecture slides and adapt them, you can do either of the following:

  1. Download the full DaSEH GitHub repository and find the file ending in .Rmd in the folder or directory called Module to copy paste our code and slide comments.

  2. Directly download individual files by searching through our module directory on GitHub, clicking on a data file and the clicking the download raw file button.

More information about how to adjust the individual files will be described in the DaSEH modification chapter.