```{r, echo = FALSE, message=FALSE, error = FALSE}
library(knitr)
opts_chunk$set(comment = "", message = FALSE)
suppressWarnings({library(dplyr)})
library(readr)
library(tidyverse)
library(jhur)
```
# Part 1: Numeric / continuous data
## Data Summarization
* Basic statistical summarization
* `mean(x)`: takes the mean of x
* `sd(x)`: takes the standard deviation of x
* `median(x)`: takes the median of x
* `quantile(x)`: displays sample quantiles of x. Default is min, IQR, max
* `range(x)`: displays the range. Same as `c(min(x), max(x))`
* `sum(x)`: sum of x
* `max(x)`: maximum value in x
* `min(x)`: minimum value in x
* **all have the ** `na.rm =` **argument for missing data**
## Statistical summarization
The vector getting summarized goes inside the parentheses:
```{r}
x <- c(1, 5, 7, 4, 2, 8)
mean(x)
range(x)
sum(x)
```
## Statistical summarization
Note that many of these functions have additional inputs regarding missing data, typically requiring the `na.rm` argument ("remove NAs").
```{r error = TRUE}
x <- c(1, 5, 7, 4, 2, 8, NA)
mean(x)
mean(x, na.rm = TRUE)
quantile(x)
quantile(x, na.rm = TRUE)
```
## Statistical summarization{.codesmall}
You can only do summarization on numeric or logical types. Not characters.
```{r error = TRUE}
x <- c(1, 5, 7, 4, 2, 8)
sum(x)
z <- c("hello", "goodbye")
sum(z)
```
## But how do we do this on dataframes?
First we will need to learn about something called the "pipe".
The pipe is this operator in R:
`%>%`
It tells R to "pipe" the dataset on the left into the next function.
## Using the pipe `%>%`
```{r}
states <- read_csv("https://hutchdatascience.org/SeattleStatSummer_R/data/states.csv")
states %>% head() # Same as head(states)!
```
## States data
`colnames()` will show us the column names.
```{r}
colnames(states)
```
## States data
We can also use the pipe:
```{r}
states %>% colnames()
```
# Summarizing the data
## Summarize the data: `summarize()` function
`summarize` creates a summary table of a column you're interested in.
```{r, eval = FALSE}
# General format - Not the code!
{data to use} %>%
summarize({summary column name} = {operator(source column)})
```
## Summarize the data: `dplyr` `summarize()` function
`summarize` creates a summary table of a column you're interested in.
```{r, eval = FALSE}
# General format - Not the code!
{data to use} %>%
summarize({summary column name} = {operator(source column)})
```
```{r}
states %>%
summarize(mean_population = mean(population))
```
## What if there are NAs in my data?
```{r}
states %>%
summarize(mean_population = mean(cesarean_percent))
states %>%
summarize(mean_population = mean(cesarean_percent, na.rm = TRUE))
```
add `na.rm = TRUE`.
## Summarize the data: `dplyr` `summarize()` function
`summarize()` can do multiple operations at once. Separate by a comma. Breaking line between these keeps things tidy!
```{r}
states %>%
summarize(mean_population = mean(population),
median_population = median(population))
```
## `summary()` Function
Using `summary()` can give you rough snapshots of each numeric column (character columns are skipped):
```{r}
summary(states)
```
Can also be written with the pipe:
```{r}
states %>% summary()
```
# Let's practice!
## Practice
Modify the code below from the `states` dataset to `summarize()` the `fertility_rate_per_1000` column. Find the mean, min, and max.
```{r eval=FALSE}
states %>%
summarize(___ = mean(___),
___ = min(___),
___ = max(___))
```
## Practice
Modify the code below from the `states` dataset to `summarize()` the `fertility_rate_per_1000` column. Find the mean, min, and max.
```{r}
states %>%
summarize(mean_fert = mean(fertility_rate_per_1000),
min_fert = min(fertility_rate_per_1000),
max_fert = max(fertility_rate_per_1000))
```
## Summary Part 1
- don't forget the `na.rm = TRUE` argument!
- `summary(x)`: quantile information
- `summarize`: creates a summary table of columns of interest
# Part 2: Categorical data
## `count` function
Use `count` to return the number of rows of data.
```{r}
states %>% count()
```
## `count` function
Use `count` to return a frequency table of unique elements of a category (column).
```{r}
states %>% count(state_region)
```
## `count` function
Multiple columns listed further subdivides the count.
```{r, message = FALSE}
states %>% count(state_region, state_division)
```
# Grouping
## Perform Operations By Groups: dplyr
`group_by` allows you group the data set by variables/columns you specify:
```{r}
# Regular data
states
```
## Perform Operations By Groups: dplyr
`group_by` allows you group the data set by variables/columns you specify:
```{r}
states_grouped <- states %>% group_by(state_region)
states_grouped
```
## Summarize the grouped data
It's grouped! Grouping doesn't change the data in any way, but how **functions operate on it**. Now we can summarize `population` by group:
```{r}
states_grouped %>% summarize(total_population = sum(population))
```
## Use the `pipe` to string these together!
Pipe `states` into `group_by`, then pipe that into `summarize`:
```{r}
states %>%
group_by(state_region) %>%
summarize(total_population = sum(population))
```
# Let's practice!
## Practice
Modify the code to group by `state_region` and summarize by average `fertility_rate_per_1000`.
```{r eval=FALSE}
states %>%
group_by(___) %>%
summarize(___ = mean(___))
```
## Practice
Modify the code to group by `state_region` and summarize by average `fertility_rate_per_1000`.
```{r}
states %>%
group_by(state_region) %>%
summarize(avg_fert = mean(fertility_rate_per_1000))
```
## Counting
`n()` can also give you the sample size per group (NAs included).
```{r}
states %>%
group_by(state_region) %>%
summarize(total_population = sum(population),
sample_size = n())
```
## Summary
- don't forget the `na.rm = TRUE` argument!
- `summary()`: quantile information
- `summarize`: creates a summary table of columns of interest
- `count(x)`: what unique values do you have?
- `group_by(x)`: changes all subsequent functions
- combine with `summarize()` to get statistics per group
- `summarize()` with `n()` gives the sample size (NAs included)
🏠 [Workshop Website](https://hutchdatascience.org/SeattleStatSummer_R/)