```{r, echo = FALSE, message=FALSE, error = FALSE} library(knitr) opts_chunk$set(comment = "", message = FALSE) suppressWarnings({library(dplyr)}) library(readr) library(tidyverse) library(jhur) ``` # Part 1: Numeric / continuous data ## Data Summarization * Basic statistical summarization * `mean(x)`: takes the mean of x * `sd(x)`: takes the standard deviation of x * `median(x)`: takes the median of x * `quantile(x)`: displays sample quantiles of x. Default is min, IQR, max * `range(x)`: displays the range. Same as `c(min(x), max(x))` * `sum(x)`: sum of x * `max(x)`: maximum value in x * `min(x)`: minimum value in x * **all have the ** `na.rm =` **argument for missing data** ## Statistical summarization The vector getting summarized goes inside the parentheses: ```{r} x <- c(1, 5, 7, 4, 2, 8) mean(x) range(x) sum(x) ``` ## Statistical summarization Note that many of these functions have additional inputs regarding missing data, typically requiring the `na.rm` argument ("remove NAs"). ```{r error = TRUE} x <- c(1, 5, 7, 4, 2, 8, NA) mean(x) mean(x, na.rm = TRUE) quantile(x) quantile(x, na.rm = TRUE) ``` ## Statistical summarization{.codesmall} You can only do summarization on numeric or logical types. Not characters. ```{r error = TRUE} x <- c(1, 5, 7, 4, 2, 8) sum(x) z <- c("hello", "goodbye") sum(z) ``` ## But how do we do this on dataframes? First we will need to learn about something called the "pipe". The pipe is this operator in R: `%>%` It tells R to "pipe" the dataset on the left into the next function. ## Using the pipe `%>%` ```{r} states <- read_csv("https://hutchdatascience.org/SeattleStatSummer_R/data/states.csv") states %>% head() # Same as head(states)! ``` ## States data `colnames()` will show us the column names. ```{r} colnames(states) ``` ## States data We can also use the pipe: ```{r} states %>% colnames() ``` # Summarizing the data ## Summarize the data: `summarize()` function `summarize` creates a summary table of a column you're interested in.
```{r, eval = FALSE} # General format - Not the code! {data to use} %>% summarize({summary column name} = {operator(source column)}) ```
## Summarize the data: `dplyr` `summarize()` function `summarize` creates a summary table of a column you're interested in.
```{r, eval = FALSE} # General format - Not the code! {data to use} %>% summarize({summary column name} = {operator(source column)}) ```
```{r} states %>% summarize(mean_population = mean(population)) ``` ## What if there are NAs in my data? ```{r} states %>% summarize(mean_population = mean(cesarean_percent)) states %>% summarize(mean_population = mean(cesarean_percent, na.rm = TRUE)) ``` add `na.rm = TRUE`. ## Summarize the data: `dplyr` `summarize()` function `summarize()` can do multiple operations at once. Separate by a comma. Breaking line between these keeps things tidy! ```{r} states %>% summarize(mean_population = mean(population), median_population = median(population)) ``` ## `summary()` Function Using `summary()` can give you rough snapshots of each numeric column (character columns are skipped): ```{r} summary(states) ``` Can also be written with the pipe: ```{r} states %>% summary() ``` # Let's practice! ## Practice Modify the code below from the `states` dataset to `summarize()` the `fertility_rate_per_1000` column. Find the mean, min, and max. ```{r eval=FALSE} states %>% summarize(___ = mean(___), ___ = min(___), ___ = max(___)) ``` ## Practice Modify the code below from the `states` dataset to `summarize()` the `fertility_rate_per_1000` column. Find the mean, min, and max. ```{r} states %>% summarize(mean_fert = mean(fertility_rate_per_1000), min_fert = min(fertility_rate_per_1000), max_fert = max(fertility_rate_per_1000)) ``` ## Summary Part 1 - don't forget the `na.rm = TRUE` argument! - `summary(x)`: quantile information - `summarize`: creates a summary table of columns of interest # Part 2: Categorical data ## `count` function Use `count` to return the number of rows of data. ```{r} states %>% count() ``` ## `count` function Use `count` to return a frequency table of unique elements of a category (column). ```{r} states %>% count(state_region) ``` ## `count` function Multiple columns listed further subdivides the count. ```{r, message = FALSE} states %>% count(state_region, state_division) ``` # Grouping ## Perform Operations By Groups: dplyr `group_by` allows you group the data set by variables/columns you specify: ```{r} # Regular data states ``` ## Perform Operations By Groups: dplyr `group_by` allows you group the data set by variables/columns you specify: ```{r} states_grouped <- states %>% group_by(state_region) states_grouped ``` ## Summarize the grouped data It's grouped! Grouping doesn't change the data in any way, but how **functions operate on it**. Now we can summarize `population` by group: ```{r} states_grouped %>% summarize(total_population = sum(population)) ``` ## Use the `pipe` to string these together! Pipe `states` into `group_by`, then pipe that into `summarize`: ```{r} states %>% group_by(state_region) %>% summarize(total_population = sum(population)) ``` # Let's practice! ## Practice Modify the code to group by `state_region` and summarize by average `fertility_rate_per_1000`. ```{r eval=FALSE} states %>% group_by(___) %>% summarize(___ = mean(___)) ``` ## Practice Modify the code to group by `state_region` and summarize by average `fertility_rate_per_1000`. ```{r} states %>% group_by(state_region) %>% summarize(avg_fert = mean(fertility_rate_per_1000)) ``` ## Counting `n()` can also give you the sample size per group (NAs included). ```{r} states %>% group_by(state_region) %>% summarize(total_population = sum(population), sample_size = n()) ``` ## Summary - don't forget the `na.rm = TRUE` argument! - `summary()`: quantile information - `summarize`: creates a summary table of columns of interest - `count(x)`: what unique values do you have? - `group_by(x)`: changes all subsequent functions - combine with `summarize()` to get statistics per group - `summarize()` with `n()` gives the sample size (NAs included) 🏠 [Workshop Website](https://hutchdatascience.org/SeattleStatSummer_R/)