Chapter 2 DaSEH Infrastructure 
2.1 Learning Objectives
In this chapter we will describe the overall infrastructure of the DaSEH Project, which includes:
- The DaSEH website resources
- Topics included in modules (lectures and labs)
- Methods for feedback
- The DaSEH GitHub repository
- The DaSEH Codeathon Resources
- Data used for DaSEH
2.2 DaSEH Website
The DaSEH website describes the mission of the DaSEH project, information for learners to apply to participate, a summary of how the course works, links to our learning content, and links to additional resources.
It is built using R Markdown on GitHub at this repository: https://github.com/fhdsl/daseh and includes Google Forms to gather various information from potential participants or those with interest or feedback.
Our landing page describes details about the idea for DaSEH. New applicants interested in participating can apply on our Apply tab.
Links to all of our learning modules (including lab activities and slides) can be found on the DaSEH website on the Course content tab on the Materials + Schedule page.
We also include links to all the data used in the course, as well as additional data that could be used for projects our extra practice, on our data page.
To help users navigate challenges, we have a page to assist with common errors on our Error FAQ page
To provide extra resources to help learners, we also have a Resources page under our Support tab where one can find extra help, cheatsheets, and videos of our previous lectures.
We like to share recordings of our lectures (using Zoom) with the students right after we record to:
- allow students who have other commitments to still participate
- allow learners to watch the materials again even if they were present during the live course, to help them review
- allow learners to look back at the material in the future for a refresher
Recordings also allow independent learners to use the other materials more effectively, as well as supports potential teaching assistants who may need to learn or refresh their knowledge on a topic.
2.2.1 Contact Info
The contact page on the website contains a contact email that may be used to send a message to DaSEH to ask a question or provide suggestions.
2.2.2 Feedback
We are continually striving to make our content better. Please contact us on our feedback form if you have ideas for suggestions for the project.
Also please let us know if you notice typos or errors, or if you are interested in getting involved.
2.2.3 Survey
There will also be a survey available on our website that allows us to do research on DaSEH resource use.
The survey should take no more than 10 minutes to complete. Your feedback helps us learn more about how to improve our resources. Part of this includes getting a better understanding of who is using our resources and how so that we can better design our materials. We would greatly appreciate you filling it out if you have the time!
2.2.4 Materials
We have the following modules on various topics related to data science for environmental public health.
Each one has lecture slides built using ioslides and R Markdown, as well as lab activities also built using R Markdown.
The rendered version of the slides, as well as the raw R Markdown files (called Rmd files) are available on the website or our GitHub repository. In addition, the rendered and raw versions of the both the lab and the lab key are also available.
We also include cheatsheets so that learners can review the functions that they learned that day.

2.2.5 Modules
This module includes the history of how R came to be and how it differs from other options like Python or Stata. It also includes how students should anticipate the experience of learning in the course and suggestions for how to learn. It also introduces jargon such as variable, row, column, and object.
This module does not have a lab associated with it.
This module includes how to assign objects, how to create vectors, and how to perform simple calculations.
The lab activity which is just an html webpage (as learners are not yet introduced to Rmd files) involves performing simple mathematical operations, assigning vectors, and checking the class and length of vectors.
This module includes a tour of RStudio, how to write code that can be saved or how to write interactive code, for instance for installing a package where the user may have to answer questions.
The lab activity involves knitting the Rmd file, running code chunks in the Rmd file, running all previous chunks, creating new chunks, and adding headers.
This module introduces concepts related to repeatability, reproducibility, and replicability. It introduces best practices for improving transparency in our code.
The lab activity features tasks like cleaning the environment, using the set.seed() function to generate the same random numbers each time, and using the sessionInfo() function to list packages and versions used.
This module describes how to import data using point-and-click methods within RStudio as well as through using functions of the readr package. We cover import of csv files and other delimited files like tab delimited files, as well as tools for importing excel files and SAS, SPSS, and Stata files. We also cover methods for checking your imported data.
The lab activity covers using the point-and-click option within RStudio and using code with readr.
This module includes information about how to access specific parts of our data that we are interested including using options to select for specific columns or filter for specific rows. In doing so we cover and and or logic for applying conditions for filtering. We also cover renaming columns (using rename from the dplyr package and clean_names from the janitor package) and best practices for naming columns, as this can impact our ability to subset our data. We discuss why pulling the data withing a column of a data frame as a vector is often needed for mathematical operations. In addition, we cover how to arrange data based on a specific order of interest, how to remove columns, and how to create columns using the mutate() function.
The lab activity covers functions like rename(), rename_with(), pull(), select(), filter() so that learners can practice renaming columns, pulling out the vector version of columns, selecting specific columns, and filtering data based on thresholds or other conditions of specific columns. Learners also practice creating new columns with mutate().
This module covers how to apply mathematical functions to get summary statistics from data including, mean, standard deviation, range, max, and min. We show how we can pull the data out as a vector to use these functions or we can use the summarize() function on columns of a data to create a new data frame with summary statistics. We also talk about the summary() function to find quantiles of data quickly.
In the lab activity, learners find the dimensions of the data, use the count function to summarize the data, use pull and mathematical functions like sum to calculate summaries of columns, as well as use the summarize() function to summarize the data.
This module covers information about different types of data classes and how to coerce data from one data type to another.
We cover numerical data, character data, logical data, Double precision data, integer data, factors, and dates. We also touch on what lists and matrices are.
In the lab activity, check the class of several vectors including those that look very similar, such as a character and logical version of “TRUE” and “FALSE”, conversion of a vector to a different data type, and conversion of a column to a new data type within a data table.
This module covers how to find and work with missing data using the naniar package and count, how to recode missing data or recode data as NA, how to recode specific values of a column or create a new column based on conditions of other columns using the case_when() function. We also cover how to separate or unite columns and how to use stringr functions to help modify values or find specific values based on patterns within the values as opposed to perfect matches.
In the lab activity, learners evaluate the missing data within a dataset and recode data within a dataset that has many different values for the same measurement. For example, “n, N, and”No” to indicate “no exposure”.
This module covers how to rearrange data so that it is either in long or wide format. We discuss how wide format can be useful for human interpretation and how long format is useful for R to use the data for analyses and data visualizations. We also discuss how to join different datasets together and describe why one might want to do this.
In the lab activity, learners use functions like pivot_longer and pivot_wider to change the shape of a dataset. They also practice doing joins of datasets together and comparing how the different joining functions work.
This module gives learners a taste of making data visualizations by using the point-and-click option of the esquisse package so that learners can quickly attempt data visualizations and get the code for generating such visualizations.
In the lab activity, learners get to try out creating different plots with esquisse.
This module dives deeper into best practices for data visualization, how to make a ggplot2 plot, and more customization options for creating data visualizations such as using themes, how to spot common issues in visualizations and fix them, as well as how to make plots interactive or how to combine plots using other packages like plotly and patchwork.
In the lab activity, learners practice making plots directly with ggplot2 and practice applying new themes to their plots and faceting their plots.
In this module learners discover why factors are a unique type of data that requires special care to make summaries, data analysis results, and visualizations showing data in the proper order. We show how to use the forecats package to reorder a factor variable based on the values of another variable.
In the lab activity, learners convert a variable to a factor class and specify new levels for the variable. They also discover how this changes the order of the variable values within data summaries and plots, as compared to the data just being a categorical variable instead of a factor.
In this module, we describe how statistical tests like t-tests, correlation, and regression are performed within R. We do not focus on the statistical interpretation, but rather how one can use R tools to perform statistical test and get the results.
In the lab activity, learners perform a correlation test between two vectors, perform a t-test, and perform regressions (including a logistic regression).
In this module, learners discover that they can save their processed data as RDS (R native) or csv files so that they don’t have to rerun processing on data (especially if it is large) or if they want to share data with others.
In the lab activity, learners write a csv and RDS file, as well as read back in an RDS file.
In this final module, learners discover the power of writing their own functions. They also learn about using the apply functions and across() function to apply the same function across different columns within a data frame. We discuss why using less repetitive code can improve the quality and efficiency of their code.
In the lab activity, learners create their own simple functions and use the across function to summarize different columns of a dataset, as well as apply a new function on the specific columns of dataset.
2.3 Slide Structure
Overall our slide structure involves the following components:
- A recap of the previous day (where appropriate)

- Context about why the particular topic matters or why someone might need to know the material

- Overviews of functions and the typical command structure to use such functions

- Gut Checks to help gauge student comprehension and to help learners stay engaged

For Gut Checks it is good to keep the question very simple and fundamental. That makes people feel successful as they work engage with the lecture and helps reinforce the most fundamental information.
- Examples using more simple data

- Applied examples with real data (where possible) Sometimes it is difficult to get a dataset that suits all your needs. In those cases we make it clear that the data was created by us and then we share it on our website for easy access and use.
1. Cautionary examples of what can go wrong

- A summary of what we discussed

- Links to course resources and potentially other resources such as Open Case Studies or Posit Cheatsheets

2.4 DaSEH GitHub Organization
GitHub is a website and cloud service that enables developers to store, manage, and track changes to their code. DaSEH uses GitHub for both development and distribution purposes. Users have complete access to all DaSEH course materials at our course website repository.
This diagram explains the folder structure of our materials:

2.5 Codeathon Resources
In addition to the modules, we have resources available that are covered during our codeathon. This link also includes our introduction, conclusion slides, and slides about running standups to get people talking about their work.
2.5.1 Codeathon Slides
Our codeathon slides include resources about:
Version Control This resources discusses the basics of using GitHub/Git and R. It includes how to make a new repository, how to upload files to GitHub like data or code, how to set up usage of Git on your local computer, and how to push changes from your local computer to GitHub.
Open Data This resource covers places to find data to perform environmental health data science research.
Code Review This resource includes the basics of how to do code review on GitHub, include the use of GitHub issues.
Data Ethics This covers information about AI use, data protection, data privacy, data provenance and more.
Mapping mini-module This covers different tools available to make maps in R, the basics for making map plots, as well as tips and resources.
2.5.2 Codeathon Schedule
Here you can see the schedule from one of our past DaSEH codeathons.
2.5.3 Stand ups
To facilitate participants communicating with us and one another, we had what we called “stand ups” throughout the codeathon. These are opportunities for participants to pause and reflect on how their work is going and share that information with others.
We had participants stand up from their desks and stand in a large circle where they would take turns talking about what was going well, what was not going well, what they were currently working on, and if they could use help or if they were interested in collaborating.
We often discovered that many of the participants would have common interests or challenges that they would sometimes decide to tackle together or share resources on the topic. We would encourage learners to help others, as this process can really solidify understanding of a topic. This can also lead to new collaborations between attendees.
This also allowed us as instructors to determine who might need extra assistance and what kind of resources we might share with them.
Importantly, it also helped participants get to know one another and to take a break from their work to think about what their next steps should be and if they needed to pivot what they were working on. This decision making process is exactly what happens in real research labs, and was a very important skill to discuss.
Feedback from participants suggested that they really enjoyed the stand up process.
2.6 Data
All of the data used in the modules are also included in the GitHub repository within the data folder. This data can be reused separately outside of our our module materials for other uses.
To use the DaSEH data, you can either:
- Download the full DaSEH GitHub repository and find the data in the folder or directory called data.
- Directly download individual files by searching through our data directory on GitHub, clicking on a data file and the clicking the download raw file button.

Information about the which modules the data is used in and information about the data sources can be found here: https://daseh.org/data.html. This page also includes other relevant datasets which can be used to develop or adapt course materials, such as for homework assignments, extra practice, or projects.
2.6.1 Adapting materials
If you are interested in creating a similar website from our course materials or you want to take our lecture slides and adapt them, you can do either of the following:
Download the full DaSEH GitHub repository and find the file ending in .Rmd in the folder or directory called Module to copy paste our code and slide comments.
Directly download individual files by searching through our module directory on GitHub, clicking on a data file and the clicking the download raw file button.
More information about how to adjust the individual files will be described in the DaSEH modification chapter.
All illustrations