mean(x)
: takes the mean of xsd(x)
: takes the standard deviation of xmedian(x)
: takes the median of xquantile(x)
: displays sample quantiles of x. Default is min, IQR, maxrange(x)
: displays the range. Same as c(min(x), max(x))
sum(x)
: sum of xmax(x)
: maximum value in xmin(x)
: minimum value in xna.rm =
argument for missing dataThe vector getting summarized goes inside the parentheses:
x <- c(1, 5, 7, 4, 2, 8) mean(x)
[1] 4.5
range(x)
[1] 1 8
sum(x)
[1] 27
Note that many of these functions have additional inputs regarding missing data, typically requiring the na.rm
argument (“remove NAs”).
x <- c(1, 5, 7, 4, 2, 8, NA) mean(x)
[1] NA
mean(x, na.rm = TRUE)
[1] 4.5
quantile(x)
Error in quantile.default(x): missing values and NaN's not allowed if 'na.rm' is FALSE
quantile(x, na.rm = TRUE)
0% 25% 50% 75% 100% 1.0 2.5 4.5 6.5 8.0
You can only do summarization on numeric or logical types. Not characters.
x <- c(1, 5, 7, 4, 2, 8) sum(x)
[1] 27
z <- c("hello", "goodbye") sum(z)
Error in sum(z): invalid 'type' (character) of argument
First we will need to learn about something called the “pipe”.
The pipe is this operator in R:
%>%
It tells R to “pipe” the dataset on the left into the next function.
%>%
states <- read_csv("https://hutchdatascience.org/SeattleStatSummer_R/data/states.csv") states %>% head() # Same as head(states)!
# A tibble: 6 × 14 entity state_abb state_area_sq_mil… state_division state_region population <chr> <chr> <dbl> <chr> <chr> <dbl> 1 Alabama AL 51609 East South Ce… South 4903185 2 Alaska AK 589757 Pacific West 731545 3 Arizona AZ 113909 Mountain West 7278717 4 Arkansas AR 53104 West South Ce… South 3017804 5 California CA 158693 Pacific West 39512223 6 Colorado CO 104247 Mountain West 5758736 # … with 8 more variables: births_in_2021 <dbl>, fertility_rate_per_1000 <dbl>, # cesarean_percent <dbl>, life_expect <dbl>, cancer_rate_per_100000 <dbl>, # cancer_mortality <dbl>, Administered_Dose1_Pop_Pct <dbl>, # Series_Complete_Pop_Pct <dbl>
colnames()
will show us the column names.
colnames(states)
[1] "entity" "state_abb" [3] "state_area_sq_miles" "state_division" [5] "state_region" "population" [7] "births_in_2021" "fertility_rate_per_1000" [9] "cesarean_percent" "life_expect" [11] "cancer_rate_per_100000" "cancer_mortality" [13] "Administered_Dose1_Pop_Pct" "Series_Complete_Pop_Pct"
We can also use the pipe:
states %>% colnames()
[1] "entity" "state_abb" [3] "state_area_sq_miles" "state_division" [5] "state_region" "population" [7] "births_in_2021" "fertility_rate_per_1000" [9] "cesarean_percent" "life_expect" [11] "cancer_rate_per_100000" "cancer_mortality" [13] "Administered_Dose1_Pop_Pct" "Series_Complete_Pop_Pct"
summarize()
functionsummarize
creates a summary table of a column you’re interested in.
# General format - Not the code! {data to use} %>% summarize({summary column name} = {operator(source column)})
dplyr
summarize()
functionsummarize
creates a summary table of a column you’re interested in.
# General format - Not the code! {data to use} %>% summarize({summary column name} = {operator(source column)})
states %>% summarize(mean_population = mean(population))
# A tibble: 1 × 1 mean_population <dbl> 1 6373716.
states %>% summarize(mean_population = mean(cesarean_percent))
# A tibble: 1 × 1 mean_population <dbl> 1 NA
states %>% summarize(mean_population = mean(cesarean_percent, na.rm = TRUE))
# A tibble: 1 × 1 mean_population <dbl> 1 30.9
add na.rm = TRUE
.
dplyr
summarize()
functionsummarize()
can do multiple operations at once. Separate by a comma. Breaking line between these keeps things tidy!
states %>% summarize(mean_population = mean(population), median_population = median(population))
# A tibble: 1 × 2 mean_population median_population <dbl> <dbl> 1 6373716. 4342705
summary()
FunctionUsing summary()
can give you rough snapshots of each numeric column (character columns are skipped):
summary(states)
entity state_abb state_area_sq_miles state_division Length:52 Length:52 Min. : 68 Length:52 Class :character Class :character 1st Qu.: 32675 Class :character Mode :character Mode :character Median : 54629 Mode :character Mean : 69654 3rd Qu.: 82587 Max. :589757 state_region population births_in_2021 fertility_rate_per_1000 Length:52 Min. : 578759 Min. : 5384 Min. :30.80 Class :character 1st Qu.: 1790876 1st Qu.: 18778 1st Qu.:53.83 Mode :character Median : 4342705 Median : 50312 Median :56.45 Mean : 6373716 Mean : 70838 Mean :56.36 3rd Qu.: 7362761 3rd Qu.: 82266 3rd Qu.:60.70 Max. :39512223 Max. :420608 Max. :68.60 cesarean_percent life_expect cancer_rate_per_100000 cancer_mortality Min. :23.40 Min. :71.90 Min. :121.0 Min. : 1093 1st Qu.:28.62 1st Qu.:75.38 1st Qu.:140.7 1st Qu.: 3514 Median :31.05 Median :76.80 Median :150.8 Median : 8921 Mean :30.93 Mean :76.62 Mean :150.3 Mean :12085 3rd Qu.:33.58 3rd Qu.:78.10 3rd Qu.:159.2 3rd Qu.:14356 Max. :38.50 Max. :80.70 Max. :184.7 Max. :59503 NA's :2 NA's :2 NA's :2 NA's :2 Administered_Dose1_Pop_Pct Series_Complete_Pop_Pct Min. :60.70 Min. :52.90 1st Qu.:69.10 1st Qu.:59.55 Median :77.20 Median :66.25 Mean :78.99 Mean :68.19 3rd Qu.:90.72 3rd Qu.:75.10 Max. :95.00 Max. :87.40
Can also be written with the pipe:
states %>% summary()
entity state_abb state_area_sq_miles state_division Length:52 Length:52 Min. : 68 Length:52 Class :character Class :character 1st Qu.: 32675 Class :character Mode :character Mode :character Median : 54629 Mode :character Mean : 69654 3rd Qu.: 82587 Max. :589757 state_region population births_in_2021 fertility_rate_per_1000 Length:52 Min. : 578759 Min. : 5384 Min. :30.80 Class :character 1st Qu.: 1790876 1st Qu.: 18778 1st Qu.:53.83 Mode :character Median : 4342705 Median : 50312 Median :56.45 Mean : 6373716 Mean : 70838 Mean :56.36 3rd Qu.: 7362761 3rd Qu.: 82266 3rd Qu.:60.70 Max. :39512223 Max. :420608 Max. :68.60 cesarean_percent life_expect cancer_rate_per_100000 cancer_mortality Min. :23.40 Min. :71.90 Min. :121.0 Min. : 1093 1st Qu.:28.62 1st Qu.:75.38 1st Qu.:140.7 1st Qu.: 3514 Median :31.05 Median :76.80 Median :150.8 Median : 8921 Mean :30.93 Mean :76.62 Mean :150.3 Mean :12085 3rd Qu.:33.58 3rd Qu.:78.10 3rd Qu.:159.2 3rd Qu.:14356 Max. :38.50 Max. :80.70 Max. :184.7 Max. :59503 NA's :2 NA's :2 NA's :2 NA's :2 Administered_Dose1_Pop_Pct Series_Complete_Pop_Pct Min. :60.70 Min. :52.90 1st Qu.:69.10 1st Qu.:59.55 Median :77.20 Median :66.25 Mean :78.99 Mean :68.19 3rd Qu.:90.72 3rd Qu.:75.10 Max. :95.00 Max. :87.40
Modify the code below from the states
dataset to summarize()
the fertility_rate_per_1000
column. Find the mean, min, and max.
states %>% summarize(___ = mean(___), ___ = min(___), ___ = max(___))
Modify the code below from the states
dataset to summarize()
the fertility_rate_per_1000
column. Find the mean, min, and max.
states %>% summarize(mean_fert = mean(fertility_rate_per_1000), min_fert = min(fertility_rate_per_1000), max_fert = max(fertility_rate_per_1000))
# A tibble: 1 × 3 mean_fert min_fert max_fert <dbl> <dbl> <dbl> 1 56.4 30.8 68.6
na.rm = TRUE
argument!summary(x)
: quantile informationsummarize
: creates a summary table of columns of interestcount
functionUse count
to return the number of rows of data.
states %>% count()
# A tibble: 1 × 1 n <int> 1 52
count
functionUse count
to return a frequency table of unique elements of a category (column).
states %>% count(state_region)
# A tibble: 5 × 2 state_region n <chr> <int> 1 North Central 12 2 Northeast 9 3 South 17 4 West 13 5 <NA> 1
count
functionMultiple columns listed further subdivides the count.
states %>% count(state_region, state_division)
# A tibble: 10 × 3 state_region state_division n <chr> <chr> <int> 1 North Central East North Central 5 2 North Central West North Central 7 3 Northeast Middle Atlantic 3 4 Northeast New England 6 5 South East South Central 4 6 South South Atlantic 9 7 South West South Central 4 8 West Mountain 8 9 West Pacific 5 10 <NA> <NA> 1
group_by
allows you group the data set by variables/columns you specify:
# Regular data states
# A tibble: 52 × 14 entity state_abb state_area_sq_m… state_division state_region population <chr> <chr> <dbl> <chr> <chr> <dbl> 1 Alabama AL 51609 East South Ce… South 4903185 2 Alaska AK 589757 Pacific West 731545 3 Arizona AZ 113909 Mountain West 7278717 4 Arkansas AR 53104 West South Ce… South 3017804 5 California CA 158693 Pacific West 39512223 6 Colorado CO 104247 Mountain West 5758736 7 Connecticut CT 5009 New England Northeast 3565287 8 Delaware DE 2057 South Atlantic South 973764 9 Florida FL 58560 South Atlantic South 21477737 10 Georgia GA 58876 South Atlantic South 10617423 # … with 42 more rows, and 8 more variables: births_in_2021 <dbl>, # fertility_rate_per_1000 <dbl>, cesarean_percent <dbl>, life_expect <dbl>, # cancer_rate_per_100000 <dbl>, cancer_mortality <dbl>, # Administered_Dose1_Pop_Pct <dbl>, Series_Complete_Pop_Pct <dbl>
group_by
allows you group the data set by variables/columns you specify:
states_grouped <- states %>% group_by(state_region) states_grouped
# A tibble: 52 × 14 # Groups: state_region [5] entity state_abb state_area_sq_m… state_division state_region population <chr> <chr> <dbl> <chr> <chr> <dbl> 1 Alabama AL 51609 East South Ce… South 4903185 2 Alaska AK 589757 Pacific West 731545 3 Arizona AZ 113909 Mountain West 7278717 4 Arkansas AR 53104 West South Ce… South 3017804 5 California CA 158693 Pacific West 39512223 6 Colorado CO 104247 Mountain West 5758736 7 Connecticut CT 5009 New England Northeast 3565287 8 Delaware DE 2057 South Atlantic South 973764 9 Florida FL 58560 South Atlantic South 21477737 10 Georgia GA 58876 South Atlantic South 10617423 # … with 42 more rows, and 8 more variables: births_in_2021 <dbl>, # fertility_rate_per_1000 <dbl>, cesarean_percent <dbl>, life_expect <dbl>, # cancer_rate_per_100000 <dbl>, cancer_mortality <dbl>, # Administered_Dose1_Pop_Pct <dbl>, Series_Complete_Pop_Pct <dbl>
It’s grouped! Grouping doesn’t change the data in any way, but how functions operate on it. Now we can summarize population
by group:
states_grouped %>% summarize(total_population = sum(population))
# A tibble: 5 × 2 state_region total_population <chr> <dbl> 1 North Central 68329004 2 Northeast 55982803 3 South 125580448 4 West 78347268 5 <NA> 3193694
pipe
to string these together!Pipe states
into group_by
, then pipe that into summarize
:
states %>% group_by(state_region) %>% summarize(total_population = sum(population))
# A tibble: 5 × 2 state_region total_population <chr> <dbl> 1 North Central 68329004 2 Northeast 55982803 3 South 125580448 4 West 78347268 5 <NA> 3193694
Modify the code to group by state_region
and summarize by average fertility_rate_per_1000
.
states %>% group_by(___) %>% summarize(___ = mean(___))
Modify the code to group by state_region
and summarize by average fertility_rate_per_1000
.
states %>% group_by(state_region) %>% summarize(avg_fert = mean(fertility_rate_per_1000))
# A tibble: 5 × 2 state_region avg_fert <chr> <dbl> 1 North Central 60.1 2 Northeast 51.2 3 South 58.0 4 West 56.3 5 <NA> 30.8
n()
can also give you the sample size per group (NAs included).
states %>% group_by(state_region) %>% summarize(total_population = sum(population), sample_size = n())
# A tibble: 5 × 3 state_region total_population sample_size <chr> <dbl> <int> 1 North Central 68329004 12 2 Northeast 55982803 9 3 South 125580448 17 4 West 78347268 13 5 <NA> 3193694 1
na.rm = TRUE
argument!summary()
: quantile informationsummarize
: creates a summary table of columns of interestcount(x)
: what unique values do you have?group_by(x)
: changes all subsequent functions
summarize()
to get statistics per groupsummarize()
with n()
gives the sample size (NAs included)