Chapter 15 Part 2. Summarizing the Data with Statistics
Now that we have the dataset loaded, let’s explore the data in more depth.
First, we should remove those samples that don’t have soil testing data yet. We could keep them in the dataset, but removing them at this stage will make the analysis a little cleaner. In this case, as we know the reason the data are missing (and that reason will not skew our analysis), we can safely remove these samples. This will not be the case for every data analysis.
We can remove the unanalyzed samples using the drop_na()
function from the tidyr
package. This function removes any rows from a table that contains NA for a particular column. This command follows the code structure:
dataset_new_name <- dataset %>% drop_na(column_name)
The `%>% is called a pipe and it tells R that the commands after it should all be applied to the object in front of it. (In this case, we can filter out all samples missing a value for “As_EPA3051” as a proxy for samples without soil testing data.)
Great! Now let’s calculate some basic statistics. For example, we might want to know what the mean (average) arsenic concentration is for all the soil samples. We can use a combination of two functions: pull()
and mean()
. pull()
lets you extract a column from your table for statistical analysis, while mean()
calculates the average value for the extracted column.
This command follows the code structure:
OBJECT %>% pull(column_name) %>% mean()
pull()
is a command from the tidyverse
package, so we’ll need to load that library before our command.
## [1] 5.10875
We can run similar commands to calculate the standard deviation (sd
), minimum (min
), and maximum (max
) for the soil arsenic values.
## [1] 5.606926
## [1] 0
## [1] 27.3
The soil testing dataset contains samples from multiple geographic regions, so maybe it’s more meaningful to find out what the average arsenic values are for each region. We have to do a little bit of clever coding trickery for this using the group_by
and summarize
functions. First, we tell R to split our dataset up by a particular column (in this case, region) using the group_by
function, then we tell R to summarize the mean arsenic concentration for each group.
When using the summarize
function, we tell R to make a new table (technically, a tibble in R) that contains two columns: the column used to group the data and the statistical measure we calculated for each group.
This command follows the code structure:
dataset %>% group_by(column_name) %>% summarize(mean(column_name))
## # A tibble: 2 × 2
## region `mean(As_EPA3051)`
## <chr> <dbl>
## 1 Baltimore City 5.56
## 2 Montgomery County 4.66
Now we know that the mean arsenic concentration might be different for each region. If we compare the samples from Baltimore City and Montgomery County, the Baltimore City samples appear to have a higher mean arsenic concentration than the Montgomery County samples.
QUESTIONS:
All the samples from Baltimore City and Montgomery County were collected from public park land. The parks sampled from Montgomery County were located in suburban and rural areas, compared to the urban parks sampled in Baltimore City. Why might the Montgomery County samples have a lower average arsenic concentration than the samples from Baltimore City?
What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value?
Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest?
Let’s say we’re interested in looking at mean concentrations that were determined using EPA Method 3051. Given that there are 8 of these measures in the soil.values
dataset, it would be time consuming to run our code from above for each individual measure.
We can add two arguments to our summarize
statement to calculate statistical measures for multiple columns at once: the across
argument, which tells R to apply the summarize
command to multiple columns; and the ends_with
parameter, which tells R which columns should be included in the statistical calculation.
We are using ends_with
because for this question, all the columns that we’re interested in end with the string ‘EPA3051’.
This command follows the code structure:
dataset %>% group_by(column_name) %>% summarize(across(ends_with(common_column_name_ending), mean))
## # A tibble: 2 × 8
## region As_EPA3051 Cd_EPA3051 Cr_EPA3051 Cu_EPA3051 Ni_EPA3051 Pb_EPA3051
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baltimore C… 5.56 0.359 34.5 35.0 17.4 67.2
## 2 Montgomery … 4.66 0.402 29.9 24.3 23.4 38.7
## # ℹ 1 more variable: Zn_EPA3051 <dbl>
This is a much more efficient way to calculate statistics.
QUESTIONS:
Calculate the maximum values for concentrations that were determined using EPA Method 3051. (HINT: change the function you call in the
summarize
statement.) Which of these metals has the maximum concentration you see, and in which region is it found?Calculate both the mean and maximum values for concentrations that were determined using the Mehlich3 test. (HINT: change the terms in the
columns_to_include
vector, as well as the function you call in thesummarize
statement.) Which of these metals has the highest average and maximum concentrations, and in which region are they found?